I’m starting on a series to cover papers from CIDR 2022, and the first one I’m covering is the paper coming out of CMU (video) that talks about why using mmap in databases are more challenging than you think.
Covering this paper is quite interesting to me for two reasons:
I look at quite a few databases or related startups from an investor POV, and quite often you see mmap being used in some part of the systems. It’s usually part of the discussion I have with founders, especially that who haven’t built datastores in production at scale, who usually underestimate by a few orders of magnitude how difficult it is to implement a new one.
One of my typical system design interview questions, when I was hiring engineers, was also discussing how to design a key-value store with specific constraints, and it’s quite often you see mmap being part of the solution as well.
Quoting directly from the paper, “mmap and DBMSs are like coffee and spicy food: an unfortunate combination that becomes obvious after the fact.”
What is MMAP?
For those that are unfamiliar with mmap, it’s a system call in Linux systems that allows you to map file content on disk into memory without having to spend I/O up front to load them. This allows programmers to very simply write data in memory, and let OS handles the I/O as needed in the background. There is also some potential performance benefits since having the OS accessing content can leverage the OS page cache directly without a user-space copy.
Many databases has started their v1 implementation using mmap for it’s main storage implementation, like Influx, Singlestore, QuestDB, etc, while a lot of them has moved away after hitting various problems.
(Source: CIDR 2022 video)
What’s wrong with mmap?
If you choose to use mmap for your database, you will hit the following four problems:
Transactional safety
Because the OS can choose to flush data from memory mapped from mmap to disk any time based on memory pressure, means the database could be writing dirty or in-progress data to disk before it’s fully committed.
Developers hitting this problem with mmap has to implement a WAL (write ahead log) with multiple copies of data either in user space or another mmaped file, which causes duplicated data with a lot more complex logic.
I/O stalls
Another related problem from the OS evicting data from memory, is that the database no longer knows which part of the memory mapped file is actually in memory. This causes unpredictable stalls when reading from a mmaped memory, and none of the solutions on top (mlock, madvise) is an easy fix.
Error handling
Databases typically do checksum matching to ensure data integrity, and this becomes hard with a mmaped file as the OS might have evict and sync to disk at any point. This essentially means the database need to validate the checksum on each access.
It’s also harder to do error handling with mmaped memory calls as instead of having generic I/O errors when reading from disk, the OS can produce signals that is harder to manage.
Performance issues
The most significant problem is actually performance. Even though calling a mmaped file can avoid system calls like open/close and use the page cache, typically OS like Linux isn’t designed with multiple threads/cores attempting to access the same mmap region which causes issues like TLB shootdowns. In evaluation we can see that performance gets worse (2-20x) especially at the point when memory is full.
Closing thoughts
The following is the advice the paper parted with →
Which in my simpler laymen terms, use mmap if you want to hack a v0/v1 product or a small in-memory db, but otherwise stay away from it.
This is also a good illustration of why building a production ready database is hard, there are lots of details in hand to make a database not lose data but also have the performance or features to be able to stand out.
This is deep, Bro