Given, at each time t = 1, 2, …, n, a new file of length l(t) and a read rate r(t), an online k-compaction algorithm must maintain a collection of at most k files, choosing (at each time t, without knowing future inputs) some of the files to merge into one, thereby incurring a merge cost equal to the total length of the merged files and a read cost equal to the read rate r(t) times the number of files present at time t. The goal is to minimize the total cost over time. K-compaction algorithms are a key component of log-structured merge trees, the file-based data structure underlying NoSQL databases such as Accumulo, Bigtable, Cassandra, HBase,and others. We initiate the theoretical study of k-compaction algorithms. We formalize the problem, consider worst-case, average-case and competitive analysis (per-instance optimality), and propose new algorithms that are optimal according to these metrics.

This is joint work with Carl Staelin, Neal E. Young, and Arman Yousefi.