Research
Networked Memory Architecture
2017 - 2020
As the storage and network technologies evolve rapidly, the CPU performance remains comparatively stagnant as Moore’s law slows in the past years. Due to this reason, the CPU running with the heavy-weight storage software can easily become the bottleneck. We tackle this problem from various aspects.
- In the OS level, we break the common wisdom of strict separation of user and kernel spaces by introducing the kernel-userspace collaboration architecture (Kuco[FAST’21]), which enables direct storage access with minimal software overhead.
- We also extend the use of NVM in distributed environment by introducing RDMA-enabled persistent distributed shared memory (or pDSM) to eliminate redundant memory copies (Octopus[USENIX ATC’17, TOS’20]).
CPU-efficient IO Engine
2020 - 2024
Purely reducing the overhead of storage software is still not enough; system designers must be also device-aware since the emerging hardware typically exhibits bizarre performance behavior. For example, NVM has asymmetric read/write performance, device-level IO amplification, and performance variability; RDMA shows limited scalability due to the device-level cache thrashing. In this context, I have designed:
- an asynchronous IO framework to hide NVM’s high access latency (EasyIO[EuroSys’24])
- a key-value store that uses compacted log to mitigate the IO amplification (FlatStore[ASPLOS’20])
- and an RPC system to enable RDMA to work at a larger scale (ScaleRPC[EuroSys’19])
Low Tail Latency Concurrency Control
2020 - 2024
Apart from seeking higher throughput and lower latency, datacenter applications also require their performance to be predictable (often defined as 99th or 99.9th percentile latencies). Latency variability can arise for many reasons, including sharing resources (e.g., CPU cores, caches, memory bandwidth, etc.), background activities, queuing, and others. In the past years, we have witnessed an active line of research work that improves performance predictability at different layers, but they ignore the fact that the workload is another source of incurring latency spikes due to request conflicts. Here, I take a much deeper dive to the concurrency protocol design with the workload-aware principle in mind.
- coordinated concurrency control for tree-based index structures (uTree[VLDB’20])
- pessimistic locking and opportunistic reading for transactional systems (Plor[SIGMOD’22])