On the path for a better select

The current indexing implementation in my tyrdbs project is based on a hybrid of prefix and b-trees. It's messy, has significant space overhead and lacks some features I'd like to see so I went looking for a suitable replacement. Last time around, I was looking at succinct trees because of their low space overhead. In a nutshell, they are encoded using bit fields and use rank/select primitives for navigating. This Wikipedia entry is a good starting point if you're interested in learning more about how they work. I eventually gave it up for the fact that I've implemented the select primitive using binary search which was too slow to use on a hot path. I've decided to give it another shoot but for that to happen, I needed to figure out a way to make select primitive faster.

Select primitive

Select primitive is simple. For any given bit field x = {x1, x2, x3, ..., xn}, select(k) where 1 <= k <= n, returns the position of k'th 1 bit. This means that an implementation has to map k values to appropriate positions in the bit field. As it turns out, it's more difficult than it seems especially if you add constraints on the size of your supporting structures. A simple approach is to store the bit field and do a binary search on it. A bit better one is to have some sort of supporting structure that would guide the binary search. One step further, you replace the bit field with a specialized index. The basic idea is to reduce the bounds of the search area enough so the binary search becomes negligible when n becomes sufficiently large, to use linear search or to be able to use lookup tables. There has been a lot of research in this area (I'll link some of the papers below) and there are many variations on these basic concepts. Most of them trade space for time complexity. For small trees (n = 2048) and with a constrained time budget, this is a deal breaker.

Broadword approach

This approach was introduced in a paper by Vigna, S. [1] and revolves around the idea of using a broad word (64/128/256 bits) to represent a part of the bit field we zeroed in on using supporting structures and calculating the result using SIMD techniques. Using this technique, processing 64 bits of the bit field is fast enough for my purposes (a bit over 2ns). For small (n = 2048) and a maxed-out tree, all we need is lg(n) * n / 64 bits of extra space (which works out to 44 bytes) to support fast select primitive. After some googling, I found the author's blog which touches on the subject and has a C implementation of the algorithm from the paper along with an improved version that's missing a lookup table that can be found here. Vigna also linked to this paper by Simon, G. and Matthias, P. [2] which has a modified broadword algorithm. One thing the modified version introduced was a second table lookup to fetch words for correcting overflows. To work around this, line 8. can be replaced with:

b = ((128 - k) * 0x0101010101010101 + s) & 0x8080808080808080

without compromising the algorithm and improving cache locality.

This is the full listing of the algorithm in python along with a way to generate the lookup table (transcribing to other languages should be easy enough):

# values dictionary:
# 0x00: 0 0 0 0 0 0 0 0
# 0x01: 1 0 0 0 0 0 0 0
# 0x02: 2 0 0 0 0 0 0 0
# 0x03: 1 2 0 0 0 0 0 0
# 0x04: 3 0 0 0 0 0 0 0
# 0x05: 1 4 0 0 0 0 0 0
# 0x06: 2 4 0 0 0 0 0 0
# 0x08: 4 0 0 0 0 0 0 0
# 0x09: 1 4 0 0 0 0 0 0
# ...

values = {}

for i in range(0, 256):
    if i not in values:
        values[i] = []

    for j in range(0, 8):
        if i & (1 << j):
            values[i].append(j)

def select(x, k):
    assert k >= 1
    assert k <= x.bit_count()

    s = x - ((x >> 1) & 0x5555555555555555)
    s = (s & 0x3333333333333333) + ((s >> 2) & 0x3333333333333333)
    s = (s + (s >> 4)) & 0x0f0f0f0f0f0f0f0f
    s = s * 0x0101010101010101

    b = ((128 - k) * 0x0101010101010101 + s) & 0x8080808080808080

    block = (8 - b.bit_count()) << 3
    offset = (k - ((s << 8) >> block) & 0xff) - 1

    byte = (x >> block) & 0xff

    v = block + values[byte][offset]

    return v

Conclusion

This algorithm is really fast and can be used on hot paths. It works on a wide enough range to help cut down supporting structure size. There's still room for improvement. For example, being able to handle more than 64 bits at a time and implement it using SSE but for now, this is enough.

References

[1] Vigna, S. (2008). Broadword implementation of rank/select queries. Experimental Algorithms. Lecture Notes in Computer Science. Vol. 5038. pp. 154–168. CiteSeerX 10.1.1.649.8950. doi:10.1007/978-3-540-68552-4_12. ISBN 978-3-540-68548-7.

[2] Gog, Simon and Matthias Petri. “Optimized succinct data structures for massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.

[3] David R. Clark. Compact Pat Trees. PhD thesis, University of Waterloo, 1996.

[4] Guy Joseph Jacobson. Succinct static data structures. PhD thesis, Carnegie Mellon University, Pittsburgh, PA,
USA, 1988. AAI8918056.

[5] Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 323–336.

[6] Vigna, S Blog: vigna.di.unimi.it/Sux/select.php

[7] Giuseppe Ottaviano's Succinct Data Structures Library: github.com/ot/succinct

[8] Simon Gog's Succinct Data Structure Library: https://github.com/simongog/sdsl-lite

[9] tyrdbs Database Toolkit Library: https://git.sr.ht/~tyrtech/tyrdbs