How to speed up MongoDB Inserts/sec?

I'm trying to maximize inserts per second. I currently get around 20k inserts/sec. My performance is actually degrading the more threads and CPU I use (I have 16 cores available). 2 threads currently do more per sec than 16 threads on a 16 core dual processor machine. Any ideas on what the problem is? Is it because I'm using only one mongod? Is it indexing that could be slowing things down? Do I need to use sharding? I wonder if there's a way to shard, but also keep the database capped...

Constraints: must handle around 300k inserts/sec, must be self-limiting(capped), must be query-able relatively quickly

Problem Space: must handle call records for a major cellphone company (around 300k inserts/sec) and make those call records query-able for as long as possible (a week, for instance)

#!/usr/bin/perl

use strict;
use warnings;
use threads;
use threads::shared;

use MongoDB;
use Time::HiRes;

my $conn = MongoDB::Connection->new;

my $db = $conn->tutorial;

my $users = $db->users;

my $cmd = Tie::IxHash->new(
    "create"    => "users",
    "capped"    => "boolean::true",
    "max"       => 10000000,
    );

$db->run_command($cmd);

my $idx = Tie::IxHash->new(
    "background"=> "boolean::true",
);
$users->ensure_index($idx);


my $myhash =
    {
        "name"  => "James",
        "age"   => 31,
        #    "likes" => [qw/Danielle biking food games/]
    };

my $j : shared = 0;

my $numthread = 2;  # how many threads to run

my @array;
for (1..100000) {
    push (@array, $myhash);
    $j++;
}

sub thInsert {
    #my @ids = $users->batch_insert(\@array);
    #$users->bulk_insert(\@array);
    $users->batch_insert(\@array);
}

my @threads;

my $timestart = Time::HiRes::time();
push @threads, threads->new(\&thInsert) for 1..$numthread;
$_->join foreach @threads; # wait for all threads to finish
print (($j*$numthread) . "\n");
my $timeend = Time::HiRes::time();

print( (($j*$numthread)/($timeend - $timestart)) . "\n");

$users->drop();
$db->drop();

Answers


Writes to MongoDB currently aquire a global write lock, although collection level locking is hopefully coming soon. By using more threads you're likely introducing more concurrency problems as the threads block eachother while they wait for the lock to be released.

Indexes will also slow you down, to get the best insert performance it's ideal to add them after you've loaded your data, however this isn't always possible, for example if you're using a unique index.

To really maximise write performance, your best bet is sharding. This'll give you a much better concurrency and higher disk I/O capacity as you distribute writes across several machines.


2 threads currently do more per sec than 16 threads on a 16 core dual processor machine.

MongoDB inserts cannot be done concurrently. Every insert needs to acquire a write lock. Not sure if that is a global or a per-collection lock, but in your case that would not make a difference.

So making this program multi-threaded does not make much sense as soon as Mongo becomes the bottleneck.

Do I need to use sharding?

You cannot shard a capped collection.


I've noticed that building the index after inserting helps.


uhmm.. you won't get that much performance from one mongodb server.

0.3M * 60 * 60 * 24 = 26G records/day, 180G records/week. I guess your records size is around 100 bytes, so that's 2.6TB data/day. I don't know what field(s) do you use for indexing but I doubt it's below 10-20 bytes, so just the daily index is going to be over 2G, not to mention the whole week.. the index won't fit into memory, with a lot of queries that's a good recipe for disaster.

You should do manual sharding, partitioning the data based on the search field(s). It's a major tel company, you should do replication. Buy a lot of single/dual core machines, you only need cores for the main (perl?) server.

BTW how do you query the data? Could you use a key-value store?


Why don't you manually cap the collection? You could shard across multiple machines and apply the indexes you need for the queries, and then every hour or so delete the unwanted documents.

The bottleneck you have is most likely the global lock - I have seen this happen in my evaluation of MongoDB for a insert-heavy time-series data application. You need to make sure the shard key is not the timestamp, otherwise all the inserts will execute sequentially on the same machine instead of being distributed across multiple machines.


Write lock on MongoDB is global but quoting this "collection-level locking coming soon".

Do I need to use sharding?

Not so easy to answer. If what you can get out of one mongod is not meeting your requirements, you kind of have to since sharding is the only way to scale writes on MongoDB (writes on different instances will not block each other).


Need Your Help

Why does Chrome wrap this table?

html css google-chrome html-table

I may be missing the obvious, but any hint on why is Chrome wrapping the last column in this table? Shouldn't it calculate the column width so that the content fits? (given that the table does not

Multilingual web application change div text?

asp.net localization multilingual

I am working with asp.net web application multilingual site.