Current Research
Operating System Design for Disaggregated Memory Architecture
- an asynchronous IO engine to hide DM’s high access latency (EasyIO[EuroSys’24])
- a pure userspace process abstraction to enforce accurate memory bandwidth allocation (Vessel[SOSP’24])
(Storage) Systems Support for Large Language Models
- KV Cache management for LLM inference ([HCacheEuroSys’25])
- Serverless LLM cold start optimization (Medusa[ASPLOS’25])
Past Research Directions
Networked Memory Architecture (2017 - 2020)
As the storage and network technologies evolve rapidly, the CPU performance remains comparatively stagnant as Moore’s law slows in the past years. Due to this reason, the CPU running with the heavy-weight storage software can easily become the bottleneck. We tackle this problem from various aspects.
- A kernel-user space collaborative architecture for scalable filesystem designs (Kuco[FAST’21])
- RDMA-enabled distributed persistent shared memory (or DPSM) (Octopus[USENIX ATC’17, TOS’20]).
- RDMA-based RPC system with scalability and reliablity (ScaleRPC[EuroSys’19])
CPU-efficient IO Engine (2020 - 2024)
Purely reducing the overhead of storage software is still not enough; system designers must be also device-aware since the emerging hardware typically exhibits bizarre performance behavior. For example, non-volatile storage devices have asymmetric read/write performance, device-level IO amplification, and performance variability; in this context, we designed:
- a holistic IO stack design for computational storage devices ($\lambda$-IO)
- a key-value store that uses compacted log to mitigate the IO amplification (FlatStore[ASPLOS’20])
Low Tail Latency Concurrency Control (2020 - 2024)
Apart from seeking higher throughput and lower latency, datacenter applications also require their performance to be predictable (often defined as 99th or 99.9th percentile latencies). Latency variability can arise for many reasons, including sharing resources (e.g., CPU cores, caches, memory bandwidth, etc.), background activities, queuing, and others. In the past years, we have witnessed an active line of research work that improves performance predictability at different layers, but they ignore the fact that the workload is another source of incurring latency spikes due to request conflicts. Here, I take a much deeper dive to the concurrency protocol design with the workload-aware principle in mind.
- coordinated concurrency control for tree-based index structures (uTree[VLDB’20])
- pessimistic locking and opportunistic reading for transactional systems (Plor[SIGMOD’22])