for any info/changes follow me: @nickmilon
mongoUtils.mapreduce module¶
map reduce operations
Note
-
mongoUtils.mapreduce.
mr
(coll, fun_map, fun_reduce=None, query={}, out={'replace': 'mr_tmp'}, fun_finalize=None, scope={}, sort=None, jsMode=False, verbose=1)[source]¶ simplified generic Map Reduce see MongoDB Map Reduce
Parameters: - coll (object) a pymongo collection instance
- fun_map js function used for map
- fun_reduce js function used for reduce defaults to a function that increments value count
- query a pymongo query dictionary to query collection, defaults to {}
- out a dictionary for output specification {replace|merge|reduce|:collection_name|db:db_name} also can specify {‘inline’:1} for in memory operation (with some limitations) defaults to {“replace”: ‘mr_tmp’}
- scope vars available during map-reduce-finalize
- sort dictionary to sort before map i.e: sort= { “_id”:1 }
- jsMode True|False (don’t convert to Bson between map & reduce if True) should be False if we expect more than 500K distinct results
- db’ (optional): database_name if no db is specified output collection will be in same db as input coll
Returns: tuple (results collection or results list if out={“inline” :1}, MR response statistics)
Example: see
group_counts()
function
-
mongoUtils.mapreduce.
group_counts
(collection, field_name='_id', query={}, out={'replace': 'mr_tmp'}, sort=None, jsMode=False, verbose=1)[source]¶ group values of field using map reduce
Parameters: see
mr()
functionExample: >>> from pymongo import MongoClient;from mongoUtils.configuration import testDbConStr # import MongoClient >>> db = MongoClient(testDbConStr).get_default_database() # get test database >>> col, res = group_counts(db.muTest_tweets_users, 'lang', out={"replace": "muTest_mr"}) # execute MR >>> res # check MR statistics {u'counts': {u'input': 997, u'reduce': 72, u'emit': 997, u'output': 21}, u'timeMillis': 256, u'ok': 1.0, u'result': u'del_1'} >>> for i in col.find(sort=[('value',-1)]): print i # print MR results {'_id': 'en', 'value': 352.0} {'_id': 'ja', 'value': 283.0} {'_id': 'es', 'value': 100.0} >>>
-
mongoUtils.mapreduce.
mr2
(operation, col_a, col_a_key, col_b, col_b_key, col_a_query=None, col_b_query=None, db=None, out=None, sort_on_key_fields=False, jsMode=False, verbose=3)[source]¶ A kind of sets operation on 2 collections Map Reduce two collection objects (col_a, col_b) on a common field (col_a_key, col_b_key) allowing queries (col_a_query, col_b_query)
Parameters: - col_a, col_b: pymongo collection objects
- col_a_key col_b_key: name of fields to run MR for col_a & col_b
- col_a_query, col_b_query optional queries to run on respective collections
- db optional db name to use for results (use ‘local’ to avoid replication on results)
- out: optional output collection name defaults to mr_operation
- sort_on_key_fields: (bool) tries to sort on key if key has an index this is supposed to speed up MR
- jsMode (True or False) (see mongo documentation)
Returns: a tuple(results_collection collection, MapReduce1 statistics, MapReduce2 statistics)
Results_collection: - if operation == ‘Orphans’:
{‘_id’: ‘XXXXX’, ‘value’: {‘A’: 2.0, ‘sum’: 3.0, ‘B’: 1.0}}
value.a = count of documents in a
value.b count of documents in b,
sum = count of documents in both A+B
to get documents non existing in col_a:
>>> resultCollection.find({'value.a':0})
to get documents non existing in col_b:
>>> resultCollection.find({'value.b':0})
to get documents existing in both col_a and col_b
>>> resultCollection.find({'value.a':{'$gt':0}, 'value.a':{'$gt': 0}})
to check for unique in both collections
>>> resultCollection.find({'value.sum':2})
- if operation = ‘Join’:
performs a join between 2 collections
{‘_id’: ‘XXXXX’, ‘value’: {‘a’: document from col_a,’b’: document from col_b}}
if a document is missing from a collection its corresponding value is None
Warning
document’s value will also be None if key exists but is None so res[0].find({‘value.b’: None}} means either didn’t exist in collection b or its value is None
Example: >>> from pymongo import MongoClient;from mongoUtils.configuration import testDbConStr # import MongoClient >>> db = MongoClient(testDbConStr).get_default_database() # get test database >>> res , stats1, stats2 = mr2('Orphans', db.muTest_tweets, 'user.screen_name', # execute col_b = db.muTest_tweets_users, col_b_key='screen_name', col_b_query= {'screen_name': {'$ne':'Albert000G'}}, verbose=0) >>> res.find({'value.b':0}).count() # not found in b 1 >>> res.find({'value.b':0})[0] {u'_id': u'Albert000G', u'value': {u'a': 1.0, u'sum': 1.0, u'b': 0.0}} >>> res = mr.mr2('Join', db.muTest_tweets, 'user.screen_name', # execute a Join col_b = db.muTest_tweets_users, col_b_key='screen_name', col_a_query= {}, col_b_query= {'screen_name':{'$ne':'Albert000G'}}, verbose=3) >>> f = res[0].find({'value.b': None}) # check missing in b >>> for i in f: print i['value']['a']['user']['screen_name'] Albert000G