Tuesday, April 05, 2011

 

11 Buggy Disappointments in MongoDB

I've been exploring MongoDB primarily for its map-reduce functionality lately. I've found a few shortcoming that I'm not crazy about.
  1. If you dump (i.e. export to BSON format) a document containing an array with undefined elements, then when you import that data, the undefined elements are lost, changing the indexes of the values. For example, exporting a document that contained a member that was created like
    var a = [];
    a[0] = "a";
    a[2] = "c";

    when imported will result in an array like ["a", "c"], with "c" moved from spot 2 to 1.
  2. If you export data from a sharded collection, only data from one of the shards is actually output.
  3. I've had strange, inconsistent problems where mapReduce would fail with no reasonable explanation after importing a large amount (~10GB) of data into a sharded collection.  Bouncing the cluster's mongods has fixed the problems.
  4. Mongo's db.eval() is much more dangerous than the documentation lets on. If you do something moderately dumb in there that uses a lot of memory, you can cause the server to run out of memory, which causes it to segfault and leaves the data store in a state that requires recovery.
  5. The JavaScript map and reduce function passed to mapReduce don't pull in variables or functions from their scope as one would normally expect of JavaScript. The mongo solution is the scope argument passed to mapReduce. Unfortunately, scope can't be used to pass functions. As far as I can tell, the Mongo approach is to add other function into the db.system.js collection, which is a pretty poor solution because it hinders code maintenance. 
  6. For reasons I can't explain, I've had mapReduce jobs fail a couple of time after running for hours and going through both the map and reduce phases entirely then reporting that a function defined in db.system.js was undefined. If it was undefined, it should've been reported as undefined before it was called 50 million times.  The same mapReduce jobs ran successfully on smaller samples of data in unsharded collections.
  7. Mongo's mapReduce is single threaded. shudder No, really. No matter how many mapReduce jobs you throw at it at once, the CP utilization will hover in the 100% range. If you want Mongo to use more than one core for mapReduce, you need multiple shards on the same box, but in practice that's a rather ineffective approach; Mongo doesn't always shard when I want it to shard, even if I set the chunk size stupidly low.
  8. If you do shard to try to get multiple cores running mapReduce, you may find that the primary eats all the memory, starving the secondary shard servers.
  9. The VM I was playing with Mongo in only had a 100 GB partition. I ran quite a few different mapreduce jobs which output into different tables, eating up lots of disk space. One of the mapreduce jobs hung. I realized that the partition had less than 2 GB of free space left, so Mongo couldn't allocate another slab. No message was returned to the client explaining this, unfortunately.
  10. I went around removing all the documents from collections and then dropping them, trying to free disk space. For whatever reason, Mongo didn't actually free any disk space after I did that. I tried bouncing the cluster to see if that would allow the servers to reclaim the freed space to no avail. Although frustrating, that makes sense given that the Mongo docs say, "This [deleted] space is reused but never freed to the operating system."
  11. The mongo docs say if you want it to return the freed space to the operating system, you should repair the DB. So I tried
    db.repairDatabase()
    Unfortunately, if your disk is full-ish, Mongo won't have space to repair. When I tried, I got the following error.
    Cannot repair database X having size: 21415067648 (bytes) because free disk space is: 887955456 (bytes)
There's still plenty to like in Mongo, but at this point, I feel like Mongo's mapReduce functionality is better suited to running queries which are too big to fit in memory, rather than serious data crunching. Perhaps my difficulties have been due to getting sharding involved with mapReduce.  It's also possible I've made a crucial mistake in configuring sharding, but I think I followed the directions pretty closely.

Labels: ,


Comments:
Have you entered issues in jira.mongodb.org on these, especially 1-3?
 
As for item 11 (which burned me twice already), mongo 1.9 will include a compact function that will reclaim that space without the need of available free space on the drive.
 
from last few days I am seeing quite a lot Mango DB articles in Dzone and mostly in favor this is something which reveals other side of mangod db. thanks for this.

Javin
10 basic mysql commands to remember
 
#1: I just committed a fix for this (supporting undefined in export/import), so it should be better in the future.

#2: I've reproduced this and another guy is working it (https://jira.mongodb.org/browse/SERVER-3086).

#3: Please report any information you can give us about 3 at https://jira.mongodb.org/!

#6: this is a known bug, unfortunately: https://jira.mongodb.org/browse/SERVER-925

#8: I don't understand what you mean, are you trying to run a bunch of shards on machine?

#10 and #11: simple compaction has already been implemented in the 1.9 branch: you could start up a server with 1.9, compact it, then start it up with 1.8 again.

Taking your middle points together: multithreading MapReduce is something we really want to do (see https://jira.mongodb.org/browse/SERVER-463... one of our most-voted-for bugs) but it's very tricky. Our first step is moving to V8, which should happen for 2.0. After that we'll be working on getting MapReduce working across multiple cores.

We're actually also working on a much simpler aggregation system (as MapReduce is overkill for a lot of people) that should make most aggregations possible to do without JavaScript. You'll be able to create a pipeline of filters, projections, groupings, tees to output collections, etc (this is for 2.0).
 
#7 10Gen is switching the Javascript shell from SpiderMonkey to V8 which should help the single thread MapReduce issue. The way I understand it, MapReduce is single-thread is because of SpiderMonkey's dependency on global variables, making multiple instances not possible.
 
Regarding #4, I agree it's too easy to crash mongod in strange ways with badly behaved JS.

We've raised https://jira.mongodb.org/browse/SERVER-3131 and are watching https://jira.mongodb.org/browse/SERVER-3012, although this is with $where clause rather than db.eval.
 
Claim #11 is nonsense. If you reorganize stuff on any database system you need additional space. What is your point here?
 
5. ... Unfortunately, scope can't be used to pass functions

This is incorrect. Instead of passing a function, put the function in an object and pass the object:

var scopeAdapter = {
emitter: function(doc) {
this.a = doc.PropA;
this.b = doc.PropB;
}
};

var map = function() {
var obj = new emitter(this);
var key = ...;
var value = ...;
emit(key, value);
};

db.runCommand({
mapreduce: ...,
scope: scopeAdapter,
map: this.map,
reduce: ...,
out: ...
});
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?