用 16G 内存存放 30亿数据(Java Map)转载

摘要:
在讨论怎么去重,提出用directbuffer建btree,想到应该有现成方案,于是找到一个好东西:MapDB-MapDB:http://www.mapdb.org/以下来自:kotek.net:http://kotek.net/blog/3G_map3billionitemsinJavaMapwith16GBRAMOnerainyeveningImeditatedaboutmemorymanag

在讨论怎么去重,提出用 direct buffer 建 btree,想到应该有现成方案,于是找到一个好东西:

MapDB - MapDB : http://www.mapdb.org/

以下来自:kotek.net : http://kotek.net/blog/3G_map

3 billion items in Java Map with 16 GB RAM

One rainy evening I meditated about memory managment in Java and how effectively Java collections utilise memory. I made simple experiment, how much entries can I insert into Java Map with 16 GB of RAM?

Goal of this experiment is to investigate internal overhead of collections. So I decided to use small keys and small values. All tests were made on Linux 64bit Kubuntu 12.04. JVM was 64bit Oracle Java1.7.0_09-b05with HotSpot23.5-b02. There is option to use compressed pointers (-XX:+UseCompressedOops), which is on by default on this JVM.

First is naive test withjava.util.TreeMap. It inserts number into map, until it runs out of memory and ends with exception. JVM settings for this test was-Xmx15G

import java.util.*; 
Map m = new TreeMap();
for(long counter=0;;counter++){
  m.put(counter,"");
  if(counter%1000000==0) System.out.println(""+counter);
}

This example ended at 172 milion entries. Near the end insertion rate slowed down thanks to excesive GC activity. On second run I replacedTreeMapwith `HashMap, it ended at 182 milions.

Java default collections are not most memory efficient option. So lets try an memory-optimized . I choosedLongHashMapfrom MapDB, which uses primitive long keys and is optimized to have small memory footprint. JVM settings is again-Xmx15G

import org.mapdb.*
LongMap m = new LongHashMap();    
for(long counter=0;;counter++){
  m.put(counter,"");
  if(counter%1000000==0) System.out.println(""+counter);
}

This time counter stopped at 276 million entries. Again near the end insertion rate slowed down thanks to excesive GC activity.
It looks like this is limit for heap-based collections, Garbage Collection simply brings overhead.

Now is time to pull out the big gun :-). We can always goof-heapwhere GC can not see our data. Let me introduce you toMapDB, it provides concurrent TreeMap and HashMap backed by database engine. It supports various storage modes, one of them is off-heap memory. (disclaimer: I am MapDB author).

So lets run previous example, but now with off-heap Map. First are few lines to configure and open database, it opens direct-memory store with transactions disabled. Next line creates new Map within the db.

import org.mapdb.*

DB db = DBMaker
   .newDirectMemoryDB()
   .transactionDisable()
   .make();

Map m = db.getTreeMap("test");
for(long counter=0;;counter++){
  m.put(counter,"");
  if(counter%1000000==0) System.out.println(""+counter);
}

This is off-heap Map, so we need different JVM settings:-XX:MaxDirectMemorySize=15G -Xmx128M. This test runs out of memory at 980 million records.

But MapDB can do better. Problem in previous sample is record fragmentation, b-tree node changes its size on each insert. Workaround is to hold b-tree nodes in cache for short moment before they are inserted. This reduces the record fragmentation to minimum. So lets change DB configuration:

DB db = DBMaker
     .newDirectMemoryDB()
     .transactionDisable()
     .asyncFlushDelay(100)
     .make();

Map m = db.getTreeMap("test");         

This records runs out of memory with1 738 million records. Speed is just amazing 1.7 bilion items are inserted within 31 minutes.

MapDB can do even better. Lets increase b-tree node size from 32 to 120 entries and enable transparent compression:

   DB db = DBMaker
            .newDirectMemoryDB()
            .transactionDisable()
            .asyncFlushDelay(100)
            .compressionEnable()
            .make();

   Map m = db.createTreeMap("test",120, false, null, null, null);

This example runs out of memory at whipping3 315 million records. It is slower thanks to compression, but it still finishes within a few hours. I could probably make some optimization (custom serializers etc) and push number of entries to somewhere around 4 billions.

Maybe you wander how all those entries can fit there. Answer is delta-key compression. Also inserting incremental key (already ordered) into B-Tree is best-case scenario and MapDB is slightly optimized for it. Worst case scenario is inserting keys at random order:

UPDATE added latter: there was bit confusion about compression. Delta-key compression is active by default on all examples. In this example I activated aditional zlib style compression.

    DB db = DBMaker
            .newDirectMemoryDB()
            .transactionDisable()
            .asyncFlushDelay(100)
            .make();

    Map m = db.getTreeMap("test");

    Random r = new Random();
    for(long counter=0;;counter++){
        m.put(r.nextLong(),"");
        if(counter%1000000==0) System.out.println(""+counter);
    }

But even with random order MapDB handles to store 651 million records, nearly 4 times more then heap-based collections.

This little excersice does not have much purpose. It is just one of many I do to optimize MapDB. Perhaps most amazing is that insertion speed was actually wery good and MapDB can compete with memory based collections.

免责声明:文章转载自《用 16G 内存存放 30亿数据(Java Map)转载》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇Let's Encrypt的HTTPS证书在阿里云OSS内部署如何在ToolBar中显示文字和图标,自定义图标大小,并和MenuItem关联下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

Cmake命令之add_subdirectory介绍

命令格式 add_subdirectory (source_dir [binary_dir] [EXCLUDE_FROM_ALL])添加一个子目录并构建该子目录。 命令解析 source_dir必选参数。该参数指定一个子目录,子目录下应该包含CMakeLists.txt文件和代码文件。子目录可以是相对路径也可以是绝对路径,如果是相对路径,则是...

nm命令 查看一个可执行文件或者库的依赖库

更详细的内容见man page。这里举例说明: nm -u hello.o 显示hello.o 中的未定义符号,需要和其他对象文件进行链接. nm -A /usr/lib/*2>/dev/null | grep "T memset" 在 /usr/lib/ 目录下找出哪个库文件定义了memset函数. [root@localhost memzone...

Oracle树查询,start with connect by prior 递归查询用法(转载)

本人觉得这个写的真不错,实用性强,就转载过来了 这个子句主要是用于B树结构类型的数据递归查询,给出B树结构类型中的任意一个结点,遍历其最终父结点或者子结点。 先看原始数据: 1 create table a_test 2 ( parentid varchar2(10), 3 subid varchar2(10)); 4 5 ins...

假设检验(Hypothesis Testing)

假设检验的定义 假设检验:先对总体参数提出某种假设,然后利用样本数据判断假设是否成立。在逻辑上,假设检验采用了反证法,即先提出假设,再通过适当的统计学方法证明这个假设基本不可能是真的。(说“基本”是因为统计得出的结果来自于随机样本,结论不可能是绝对的,所以我们只能根据概率上的一些依据进行相关的判断。) 假设检验依据的是小概率思想,即小概率事件在一次试验中基...

企业级虚拟化实战之KVM——虚拟机迁移

迁移概述 系统的迁移是指把源主机上的操作系统和应用程序移动到目的主机,并且能够在目的主机上正常运行 在没有虚拟机的时代,物理机之间的迁移依靠的是系统备份和恢复技术。在源主机上实时备份操作系统和应用程序的状态,然后把存储介质连接到目标主机上,最后在目标主机上恢复系统。随着虚拟机技术的发展,系统的迁移更加灵活和多样化。 迁移的目的: 简化系统维护管理 提高系...

Dotnet中Span, Memory和ReadOnlySequence之浅见

过年啦,写个短点的。同时,提前给大家拜个年。   总有小伙伴们跑过来讨论关于Span和Memory的使用,眼瞅是最近关于Span的文章有点多,看飞了。 今天写这个,就是往回拉一拉。 写之前,先声明一下。这些内容是我自己使用的一些经验,并不代表这些类的全部内容就是这些,只是说,我是这么用的,而且用得很好。 1. Span Span在我的概念中,就是一个快速的...