I think this community is the right place to start a conversation about NUMA (aren't NUMA nodes to memory what multiprocessors are to processing? ;). I apologize if this is considered off-topic.
We are developing a Java in-memory analytical database (it's called "ActivePivot") that our customers deploy on ever larger datasets. Some ActivePivot instances are deployed on java heaps close to 1TB, on NUMA servers (typically 4 Xeon processors and 4 NUMA nodes). This is becoming a trend, and we are researching solutions to improve our performance on NUMA configurations.
We understand that in the current state of things (and including JDK8) the support for NUMA in hotspot is the following:
* The young generation heap layout can be NUMA-Aware (partitioned per NUMA node, objects allocated in the same node than the running thread)
* The old generation heap layout is not optimized for NUMA (at best the old generation is interleaved among nodes which at least makes memory accesses somewhat uniform)
* The parallel garbage collector is NUMA optimized, the GC threads focusing on objects in their node.
Yet activating -XX:+UseNUMA option has almost no impact on the performance of our in-memory database. It is not surprising, the pattern for a database is to load the data in the memory and then make queries on it. The data goes and stays in the old generation, and it is read from there by queries. Most memory accesses are in the old gen and most of those are not local.
I guess there is a reason hotspot does not yet optimize the old generation for NUMA. It must be very difficult to do it in the general case, when you have no idea what thread from what node will read data and interleaving is. But for an in-memory database this is frustrating because we know very well which threads will access which piece of data. At least in ActivePivot data structures are partitioned, partitions are each assigned a thread pool so the threads that allocated the data in a partition are also the threads that perform sub-queries on that partition. We are a few lines of code away from binding thread pools to NUMA nodes, and if the garbage collector would leave objects promoted to the old generation on their original NUMA node memory accesses would be close to optimal.
* Are there hidden or experimental hotspot options that allow NUMA-Aware partitioning of the old generation?
* Do you know why there isn't much (visible, generally available) research on NUMA optimizations for the old gen? Is the Java in-memory database use case considered a rare one?
* Maybe we should experiment and even contribute new heap layouts to the open-jdk project. Can some of you guys comment on the difficulty of that?
Thanks for reading,
Director Research & Development