Database
Management

Enumeration

Enumeration is a way to encode symbols so that they take less storage space.
For example, given this list of symbols `AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
The actual domain is just three symbols `AAPL`GOOG`MSFT
Without enumeration, each symbol takes 8 bytes, and the list above will take 80 bytes.
With enumeration, we can encode the list as a list of numeric values. Note that 1 byte can represent 256 different values, which is far more than what we need to encode 3 symbols, so we can represent everything in our domain with only 1 byte. The list will only take 10 bytes with enumeration.

There are two ways to create enumeration in KDB: To illustrate the two methods:

            q) stocks:`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
            q) domainStocks: distinct stocks
            q) show enumStocks:domainStocks$stocks
            `s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG

            // alternatively, use ? and provide an empty domain, then let it update itself
            // the backtick before the domain ` is essential
            q) `createDomain?`stocks
            `s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
            q) createDomain
            `AAPL`GOOG`MSFT
        
Several things to note here: Two ways to unenumerate a list: get or value

            q) enumStocks
            `s$`AAPL`GOOG`AAPL`MSFT`GOOG`AAPL`GOOG`AAPL`MSFT`GOOG
            q) unenumeratedStocks: get enumStocks
            q) unenumeratedStocks2: value enumStocks
        
Advanced enumeration:
The sym file in a database is essentially all symbols enumerated (not just from sym column, but all symbol columns). There are helper functions in the .Q namespace to deal with this file, such as .Q.en.

Foreign Keys

The idea is similar to relational databases where you have foreign keys to reference primary keys in other tables.
Say you have two tables trades and sector, with their meta look like this:

            q) meta trades
            c     | t f       a
            -------------------
            sym   | s sector
            price | f
            size  | f

            q) meta sector
            c     | t f a
            -------------------
            symb  | s
            sector| s
        
Here in trades there is a foreign key that links to sector. The sector table is keyed on symb.
For every sym in trades, it must appear in the symb column in sector.
If you upsert a record to trades with a sym that is unknown in sector, you will get an error.
Once you set it up correctly, you will be able to directly query the sector column from trades without having to join with the sector table.
e.g. select from trades where sym.sector=`tech

Compression

Tables are save compressed to reduce space on disk.
Three keywords here: Two ways to do compression in KDB: What do the integers in -19! and .z.zd mean?

Database


            // taken from https://code.kx.com/q/database/segment/ with some minor edits
            DISK 0             DISK 1                    DISK 2  
            db                 db                        db             
            ├── sym            ├── sym                   ├── sym
            └── par.txt        ├── 2020.10.03            ├── 2020.10.04                         
                               │   ├── quotes            │   ├── quotes                         
                               │   │   ├── .d            │   │   ├── .d
                               │   │   ├── sym           │   │   ├── sym                          
                               │   │   ├── price         │   │   ├── price                            
                               │   │   └── time          │   │   └── time                           
                               │   └── trades            │   └── trades                         
                               │       ├── .d            │       ├── .d
                               │       ├── sym           │       ├── sym                          
                               │       ├── price         │       ├── price                            
                               │       ├── time          │       ├── time                           
                               │       └── vol           │       └── vol                          
                               ├── 2020.10.05            ├── 2020.10.06                         
                               │   ├── quotes            │   ├── quotes      
        

Overview

The image above demonstrates KDB architecture.
If you look at the entire structure, it is a segmented database, suitable for scalability and parallel access.
If you only look at Disk 1, it is a partitioned dtabase, suitable for more than 100 million rows, or steadily growing data. But if the database grows large enough, since all partitions share I/O, it can become a bottleneck, and is better to use a segmented database.
If you only look at one table such as quotes, it is a splayed table, stores up to 100 million rows.
The smallest unit is a flat table, which stores up to a few million rows.

Some important files you need to pay attention to:

Flat tables

For flat tables, see here.

Splayed tables

Saving splayed tables

Saving a splayed table without symbols is just very slightly different from saving a flat table.

            // just make sure the path ends with a slash
            q) `:db/tbl/ set tbl
        
But if the table contains symbols, we need to enumerate, otherwise there will be a type error.
Two ways to do it:

            q) `:db/tbl/ set .Q.en[`:db/; tbl]
            q) `:db/tbl/ set .Q.ens[`:db/; tbl; symFile]
        
The only difference is that .Q.ens let you use a custom symbol file instead of the default one located at the root level of the database.

Loading splayed tables

Two ways to load splayed tables: either use get or just \l the table.
What's happening under the hood isn't really "loading" the data into memory, but "mapping" the data on disk into memory, meaning we only loaded in the "structure" of the table and where to find the data on disk.

            q) tbl: get `:db/tbl
            q) \l db/tbl
        

Querying splayed tables

If you only want to select or sort the table, you don't even have to load it. You can directly operate on the splayed table as if it is in memory. But you won't be able to exec or update it.

            q) select from `:db/tbl where col1=1        // this is okay
            q) delete col2 from `:db/tbl                // this will error
        

Updating splayed tables

To update a splayed table, you need to consider three things: Here are some examples:

            // only update rows but with symbols
            q) `:db/tbl/ upsert .Q.en[`:db/tbl;]([]sym:`AAPL`GOOG;col1:100 200)
            
            // add a new column
            q) `:db/tbl/newCol set 100 200 300                  // after this step, you won't be able to see the column yet
            q) `:db/tbl/.d set get[`:db/tbl/.d],`newCol         // now you can see the column
            // delete a column
            q) system "rm db/tbl/newCol"                        // again, after this step, it's not done yet
            q) `:db/tbl/.d set get[`:db/tbl/.d] except `newCol  // now it's done
        

Partitioned tables

A partitioned database is essentially a collection of splayed tables.
They can be partitioned by dates, or other stuff like months. But one database can only be partitioned by one thing.
Once you decide how to partition it, KDB will add a virtual column corresponding to the data type you choose. For example, if you choose to partition by date, it will add a virtual date column.

Saving partitioned tables

Similar to what we do with splayed tables, but requires one extra step to find the actual path based on the date.

            q) enumTable: .Q.en[`:db/; tbl]
            q) dt: first exec distinct date from tbl            // e.g, 2020.01.01
            q) path: ` sv `:db/,(`$string dt),`tbl`             // get the path `:db/2020.01.01/tbl/
            q) path set select from enumTable where date=dt
        
Above only saves one date. To partition the entire table, you can do something like this:

            saveTableParted:{[dir;t;tableName]
                dts: exec distinct date from t;
                enumTable: .Q.en[dir; t];
                {[dir;tab;tableName;dt]
                    toSave: select from enumTable where date=dt;
                    path: ` sv dir, (`$string dt), tableName,`;
                    path set toSave;
                }[dir;enumTable;tableName] each dts
            }
        

Loading, querying and updating partitioned tables

Loading a partitioned table is the same as loading a splayed table.
Similar to splayed table, you can't exec or update a partitioned table.
To update a partitioned table is similar to updating a splayed table. You need to consider the same three things.

Sorting large partitions

For large partitioned table, it might not be a good idea to sort the entire table. Instead we sort one column and then use the sorted index to sort other columns.

            // function to perform sorting
            indexSort:{[par;col;ordering]
                @[hsym par; col; @[;ordering]];
                // how to understand this function: say ordering is [2 0 3 1], and col is `quantity (20 40 10 30)
                // @[;ordering] is a partial function waiting for a list input
                // so @[;ordering] on `quantity is the same as (20 40 10 30)[2 0 3 1], which will return 10 20 30 40
                // then we use @[hsym par; col; ...] to write this new value back to the column on disk
            }

            // using the function
            q) sortedIndex: iasc get `:/db/2020.01.01/trades/time                   // get the sorted index
            q) indexSort[`:db/2020.01.01/trades; sortedIndex] each colsOnDisk       // sort each column using the sorted index
            q) @[partition; `time; `s#]                                             // apply sorted attribute to the time column
        

Useful utility functions

There are many useful utility functions in the .Q namespace for working with partitioned tables. Another very handy thing is dbmaint. This script is VERY, VERY useful.

Segmented database

Similar to what we did for partitioned tables to find correct path of a partition, we need to do one extra step for segmented database to find the correct path of the disk mount.

Saving to segmented database

The idea is that, data should be distributed as evenly across all mounts as possible to ensure best performance.
One vanilla way to do this is to use the mod operator to determine which mount to use.
Using an example to illustrate: if we want to find out which disk does 2016.01.01 go to, and we have 4 mounts, we do 2016.01.01 mod 4.
This gives us 0 so this partition should go to the first mount.
In general, to save to a segmented database, we do something like this:

            saveSegDB:{[dt; tabName]
                mounts: read0 `:par.txt;                                    // read all mounts
                numMounts: count mounts;
                allocateTo: mounts dt mod numMounts;                        // determine which mount to use, e.g., `:/disk1/db1
                path: ` sv hsym[`$allocateTo], (`$string dt), tabName;      // construct the path, e.g., `:/disk1/db1/2016.01.01/trades/
                path set tabName;
            }
        
If you update a segmented database with a new mount, you need to update par.txt manually.

Loading from segmented database

For segmented database, you almost never load the entire database into memory.
All you need is actually just the par.txt file which contains the location of all partitions.
This can be done by simply loading the root of the database, e.g., \l segDBRoot
And you will be able to query the database as if it is in memory.

Useful utility functions