for any info/changes follow me: @nickmilon

mongoUtils.mapreduce module¶

map reduce operations

Note

see mongodb manual
optimize by sorting on emit field
see notes
but also see my ticket (probably solved after v3)
when mr output is {‘inline’:1 } MAX output size is 16MB (Max BOSON doc size)
when map reduce runs on a replica secondary only output option is ‘inline’

mongoUtils.mapreduce.mr(coll, fun_map, fun_reduce=None, query={}, out={'replace': 'mr_tmp'}, fun_finalize=None, scope={}, sort=None, jsMode=False, verbose=1)[source]¶

simplified generic Map Reduce see MongoDB Map Reduce

Parameters:	coll (object) a pymongo collection instance fun_map js function used for map fun_reduce js function used for reduce defaults to a function that increments value count query a pymongo query dictionary to query collection, defaults to {} out a dictionary for output specification {replace\|merge\|reduce\|:collection_name\|db:db_name} also can specify {‘inline’:1} for in memory operation (with some limitations) defaults to {“replace”: ‘mr_tmp’} scope vars available during map-reduce-finalize sort dictionary to sort before map i.e: sort= { “_id”:1 } jsMode True\|False (don’t convert to Bson between map & reduce if True) should be False if we expect more than 500K distinct results db’ (optional): database_name if no db is specified output collection will be in same db as input coll
Returns:	tuple (results collection or results list if out={“inline” :1}, MR response statistics)
Example:	see `group_counts()` function

mongoUtils.mapreduce.group_counts(collection, field_name='_id', query={}, out={'replace': 'mr_tmp'}, sort=None, jsMode=False, verbose=1)[source]¶

group values of field using map reduce

Parameters:

see mr() function

Example:

>>> from pymongo import MongoClient;from mongoUtils.configuration import testDbConStr      # import MongoClient
>>> db = MongoClient(testDbConStr).get_default_database()                                  # get test database
>>> col, res = group_counts(db.muTest_tweets_users, 'lang', out={"replace": "muTest_mr"})  # execute MR
>>> res                                                                                    # check MR statistics
{u'counts': {u'input': 997, u'reduce': 72, u'emit': 997, u'output': 21},
u'timeMillis': 256, u'ok': 1.0, u'result': u'del_1'}
>>> for i in col.find(sort=[('value',-1)]): print i                                        # print MR results
{'_id': 'en', 'value': 352.0}
{'_id': 'ja', 'value': 283.0}
{'_id': 'es', 'value': 100.0}
>>>

mongoUtils.mapreduce.mr2(operation, col_a, col_a_key, col_b, col_b_key, col_a_query=None, col_b_query=None, db=None, out=None, sort_on_key_fields=False, jsMode=False, verbose=3)[source]¶

A kind of sets operation on 2 collections Map Reduce two collection objects (col_a, col_b) on a common field (col_a_key, col_b_key) allowing queries (col_a_query, col_b_query)

Results_collection:
Parameters:	col_a, col_b: pymongo collection objects col_a_key col_b_key: name of fields to run MR for col_a & col_b col_a_query, col_b_query optional queries to run on respective collections db optional db name to use for results (use ‘local’ to avoid replication on results) out: optional output collection name defaults to mr_operation sort_on_key_fields: (bool) tries to sort on key if key has an index this is supposed to speed up MR jsMode (True or False) (see mongo documentation)
Returns:	a tuple(results_collection collection, MapReduce1 statistics, MapReduce2 statistics)
	if operation == ‘Orphans’: {‘_id’: ‘XXXXX’, ‘value’: {‘A’: 2.0, ‘sum’: 3.0, ‘B’: 1.0}} value.a = count of documents in a value.b count of documents in b, sum = count of documents in both A+B to get documents non existing in col_a: >>> resultCollection.find({'value.a':0}) to get documents non existing in col_b: >>> resultCollection.find({'value.b':0}) to get documents existing in both col_a and col_b >>> resultCollection.find({'value.a':{'$gt':0}, 'value.a':{'$gt': 0}}) to check for unique in both collections >>> resultCollection.find({'value.sum':2}) if operation = ‘Join’: performs a join between 2 collections {‘_id’: ‘XXXXX’, ‘value’: {‘a’: document from col_a,’b’: document from col_b}} if a document is missing from a collection its corresponding value is None Warning document’s value will also be None if key exists but is None so res[0].find({‘value.b’: None}} means either didn’t exist in collection b or its value is None
Example:	>>> from pymongo import MongoClient;from mongoUtils.configuration import testDbConStr # import MongoClient >>> db = MongoClient(testDbConStr).get_default_database() # get test database >>> res , stats1, stats2 = mr2('Orphans', db.muTest_tweets, 'user.screen_name', # execute col_b = db.muTest_tweets_users, col_b_key='screen_name', col_b_query= {'screen_name': {'$ne':'Albert000G'}}, verbose=0) >>> res.find({'value.b':0}).count() # not found in b 1 >>> res.find({'value.b':0})[0] {u'_id': u'Albert000G', u'value': {u'a': 1.0, u'sum': 1.0, u'b': 0.0}} >>> res = mr.mr2('Join', db.muTest_tweets, 'user.screen_name', # execute a Join col_b = db.muTest_tweets_users, col_b_key='screen_name', col_a_query= {}, col_b_query= {'screen_name':{'$ne':'Albert000G'}}, verbose=3) >>> f = res[0].find({'value.b': None}) # check missing in b >>> for i in f: print i['value']['a']['user']['screen_name'] Albert000G

mongoUtils.mapreduce module¶

Previous topic

Next topic