for any info/changes follow me: @nickmilon

mongoUtils.mapreduce module

map reduce operations

Note

optimize by sorting on emit field see notes
but also see my ticket (probably solved after v3)
when mr output is {‘inline’:1 } MAX output size is 16MB (Max BOSON doc size)
when map reduce runs on a replica secondary only output option is ‘inline’
mongoUtils.mapreduce.mr(coll, fun_map, fun_reduce=None, query={}, out={'replace': 'mr_tmp'}, fun_finalize=None, scope={}, sort=None, jsMode=False, verbose=1)[source]

simplified generic Map Reduce see MongoDB Map Reduce

Parameters:
  • coll (object) a pymongo collection instance
  • fun_map js function used for map
  • fun_reduce js function used for reduce defaults to a function that increments value count
  • query a pymongo query dictionary to query collection, defaults to {}
  • out a dictionary for output specification {replace|merge|reduce|:collection_name|db:db_name} also can specify {‘inline’:1} for in memory operation (with some limitations) defaults to {“replace”: ‘mr_tmp’}
  • scope vars available during map-reduce-finalize
  • sort dictionary to sort before map i.e: sort= { “_id”:1 }
  • jsMode True|False (don’t convert to Bson between map & reduce if True) should be False if we expect more than 500K distinct results
  • db’ (optional): database_name if no db is specified output collection will be in same db as input coll
Returns:

tuple (results collection or results list if out={“inline” :1}, MR response statistics)

Example:

see group_counts() function

mongoUtils.mapreduce.group_counts(collection, field_name='_id', query={}, out={'replace': 'mr_tmp'}, sort=None, jsMode=False, verbose=1)[source]

group values of field using map reduce

Parameters:

see mr() function

Example:
>>> from pymongo import MongoClient;from mongoUtils.configuration import testDbConStr      # import MongoClient
>>> db = MongoClient(testDbConStr).get_default_database()                                  # get test database
>>> col, res = group_counts(db.muTest_tweets_users, 'lang', out={"replace": "muTest_mr"})  # execute MR
>>> res                                                                                    # check MR statistics
{u'counts': {u'input': 997, u'reduce': 72, u'emit': 997, u'output': 21},
u'timeMillis': 256, u'ok': 1.0, u'result': u'del_1'}
>>> for i in col.find(sort=[('value',-1)]): print i                                        # print MR results
{'_id': 'en', 'value': 352.0}
{'_id': 'ja', 'value': 283.0}
{'_id': 'es', 'value': 100.0}
>>>
mongoUtils.mapreduce.mr2(operation, col_a, col_a_key, col_b, col_b_key, col_a_query=None, col_b_query=None, db=None, out=None, sort_on_key_fields=False, jsMode=False, verbose=3)[source]

A kind of sets operation on 2 collections Map Reduce two collection objects (col_a, col_b) on a common field (col_a_key, col_b_key) allowing queries (col_a_query, col_b_query)

Parameters:
  • col_a, col_b: pymongo collection objects
  • col_a_key col_b_key: name of fields to run MR for col_a & col_b
  • col_a_query, col_b_query optional queries to run on respective collections
  • db optional db name to use for results (use ‘local’ to avoid replication on results)
  • out: optional output collection name defaults to mr_operation
  • sort_on_key_fields: (bool) tries to sort on key if key has an index this is supposed to speed up MR
  • jsMode (True or False) (see mongo documentation)
Returns:

a tuple(results_collection collection, MapReduce1 statistics, MapReduce2 statistics)

Results_collection:
 
  • if operation == ‘Orphans’:
    • {‘_id’: ‘XXXXX’, ‘value’: {‘A’: 2.0, ‘sum’: 3.0, ‘B’: 1.0}}

    • value.a = count of documents in a

    • value.b count of documents in b,

    • sum = count of documents in both A+B

    • to get documents non existing in col_a:

    • >>> resultCollection.find({'value.a':0})
      
    • to get documents non existing in col_b:

    • >>> resultCollection.find({'value.b':0})
      
    • to get documents existing in both col_a and col_b

    • >>> resultCollection.find({'value.a':{'$gt':0}, 'value.a':{'$gt': 0}})
      
    • to check for unique in both collections

    • >>> resultCollection.find({'value.sum':2})
      
  • if operation = ‘Join’:
    • performs a join between 2 collections

    • {‘_id’: ‘XXXXX’, ‘value’: {‘a’: document from col_a,’b’: document from col_b}}

    • if a document is missing from a collection its corresponding value is None

    • Warning

      document’s value will also be None if key exists but is None so res[0].find({‘value.b’: None}} means either didn’t exist in collection b or its value is None

Example:
>>> from pymongo import MongoClient;from mongoUtils.configuration import testDbConStr      # import MongoClient
>>> db = MongoClient(testDbConStr).get_default_database()                                  # get test database
>>> res , stats1, stats2 = mr2('Orphans', db.muTest_tweets, 'user.screen_name',            # execute
        col_b = db.muTest_tweets_users, col_b_key='screen_name',
        col_b_query= {'screen_name': {'$ne':'Albert000G'}}, verbose=0)
>>> res.find({'value.b':0}).count()                                                        # not found in b
1
>>> res.find({'value.b':0})[0]
{u'_id': u'Albert000G', u'value': {u'a': 1.0, u'sum': 1.0, u'b': 0.0}}
>>> res = mr.mr2('Join', db.muTest_tweets, 'user.screen_name',                             # execute a Join
          col_b = db.muTest_tweets_users, col_b_key='screen_name', col_a_query= {},
          col_b_query= {'screen_name':{'$ne':'Albert000G'}},   verbose=3)
>>> f = res[0].find({'value.b': None})                                                     # check missing in b
>>> for i in f: print i['value']['a']['user']['screen_name']
Albert000G