LRN

About

This document covers information about storing LRN.

Storing Entire LRN data in redis

This is an article written by avimarcus about storing the entire LRN DB, not about caching entries.

Reasons

Why not MySQL?

Storing LRN data in MySQL was very fast to just insert into the database and create a basic key (under 30 mins). However, applying diffs would take a significant amount of time.
It is faster with a unique key, but that likewise took a long time.
Importing diffs - 210 of them at the time of writing - took over 48 hours.
Worse, if using MyISAM importing diffs seemed to lock the table to reads.
Using InnoDB with row level locking seemed to take much, much longer to do initial import.

Why not PostgreSQL?

This author is not that familiar with PG but did attempt. The main issue was using COPY from the file would hit a corrupt line and abort. I didn't even get to test further diff import times.

Why not memcache?

Memcache supposedly evicts keys randomly even if the memory isn't needed. See #2 in this article.
Also, there's no persistence in memcache.

Why redis?

redis doesn't evict keys.
redis has a pretty good persistence method
redis will update keys quickly since it must have a unique index (in sql terms) on each key.
redis won't lock out requests when updating keys.

Wait, so why didn't you start with redis?

redis keeps everything in memory
Normal key->value have a large overhead that makes each entry about 90bytes.
480 million records * 90 bytes = 40.2331352 gigabytes -- that's quite a chunk

Method

"Method, or How I got storage for 480million records to under 7.5gigabytes"

The redis website suggests a memory optimization method of batching many entries under a single key.

So instead of storing 3 keys:

2011111111 -> 7323330000 (SET 2011111111 7323330000)
2011111112 -> 7323330000 (SET 2011111112 7323330000)
2011111113 -> 7182220000 (SET 2011111113 7182220000)
(retrieve: GET 2011111113)
each object: encoding:int serializedlength:11
Total Usage: 400 bytes (3 keys)

Store them in chunk of 100, in a HASH, e.g.

20111111 -> 11 -> 7323330000 (HSET 20111111 11 7323330000)
         -> 12 -> 7323330000 (HSET 20111111 12 7323330000)
         -> 13 -> 7182220000 (HSET 20111111 13 7182220000)
(retrieve: HGET 20111111 13)
debug object 20111111 encoding:zipmap serializedlength:41
Total Usage: 128 bytes (3 keys)

Here's data from some much large numbers:

Results on a sample dump of 6,060,574 numbers:

Stored straight up, "SET key val"
- Memory usage of 91 bytes each.
Stored in chunks of 100: 148489 keys, approx 40 per key.
- Memory usage: 17.2 bytes per stored number pair.
Stored in chunks of 1000: 26535 keys, approx 228 per key.
- Memory usage: 50.2 bytes each -- with default settings.
Stored in chunks of 1000: 26535 keys, approx 228 per key.
- Memory usage: 17 bytes each -- with config of "hash-max-ziplist-entries 1000". Only a 1.1% savings.

There's several ways batching like this saves space:

Instead of overhead of storing LRU and volatility on each key, it's stored for about 50 keys at a time. (On average, it's 50 keys per 100 grouping, since not all ranges have an LRN entry)
Instead of storing the first 8/10 digits 50 times, it gets stored once. That's a savings of 8 bytes * 49/50 * 480million = 3.5 gigabytes
Assumption: npanxxyyyy doesn't fit in a 4 byte INT, it requires an 8 byte INT. However, npanxxyy does fit in a 4 byte int, and the last yy fits into a 1 byte INT. So we went from 8 bytes to 5 bytes. 3 bytes * 480 million = 1.34 gigabytes
Supposition: Since the overall usage goes down to ~17 bytes per number (despite zipmap overhead?) I think the zipmap is further compressing them. Since in each batch many are LRN'd to the same destination, it's perhaps compressing that, too.

Import Speed

I wrote a Node.js script to split the lines up, and send directly over TCP to redis (using a node.js NET connection writable stream.)
- (Using native redis format didn't seem to help, since it bottlenecked on TCP streaming, and native is longer. Simply use "HSET ....." format.)
node.js ran at 100% CPU and redis at 80% cpu and imported around 190k keys/second, so a full import should be under 45 minutes.
Initial import speed isn't quite so crucial - once it's imported, you don't need to import again until the next new full dump. If you do it on another machine, then you can reload faster from redis' dump.rdb in 1-3 minutes depending on the speed of the machine.
Each diff takes just seconds to load and doesn't block the redis DB. (Redis forks and saves a new RDB file if you have that set up.)

Persistence

Storage to the dump.rdb file is ~2.6gb, with compression enabled in redis.conf.
It takes under 3 minutes (155-165 seconds) to cold-start redis to load into RAM the persistence [on the machine I tested]. Do note that redis will not serve anything until the file has finished importing.

About​

Storing Entire LRN data in redis​

Reasons​

Method​

Import Speed​

Persistence​