for any info/changes follow me: @nickmilon

mongoUtils.schema module

Schema Analyzer Utilities for mongoDB collection based on map reduce

mongoUtils.schema.schema(collection, query={}, out={'replace': 'tmp_mrFields'}, meta=False, scope={'parms': {'levelMax': -1, 'inclHeaderKeys': False}}, verbose=2)[source]

discovers all field’s names used by a a collection’s documents for a different approach see here also mongoDB will introduce a similar tool fields of the form xxx.floatApprox xxx.bottom’, xxx.top are internal mongoDB field for storing long integers outputs to local db so results don’t get replicated

Parameters:
  • collection: a mongoDB collection

  • query: a pymongo query dictionary to filter documents that will be searched to a subset of a collection (useful for large collections)

  • out: map reduce output specificatins dictionary (see see mr() function except for it can’t be inline

  • meta: if True results are passed to schema_meta() function for analysis

  • scope: a dictionary {‘parms’: {‘levelMax’: -1, ‘inclHeaderKeys’: False}}
    • levelMax: (int) max level for keys if -1 any level (defaults to -1)
    • inclHeaderKeys: (bool) if True includeds top level keys
  • verbose: (int) if > 0 prints progress and output

Example:
>>> from pymongo import MongoClient;from mongoUtils.configuration import testDbConStr      # import MongoClient
>>> db = MongoClient(testDbConStr).get_default_database()                                  # get test database
r = schema(db.muTest_tweets, meta=True, verbose=1)                                         # check fields
........................................................................................................
|                      field                       |      cnt       |percent|depth|       notes        |
........................................................................................................
|                       _id                        |           1,000| 100.00|    1|                    |
|                   contributors                   |           1,000| 100.00|    1|                    |
|                   coordinates                    |           1,000| 100.00|    1|                    |
|             coordinates.coordinates              |              18|   1.80|    2|                    |
|                 coordinates.type                 |              18|   1.80|    2|                    |
|                    created_at                    |           1,000| 100.00|    1|                    |
|                     entities                     |           1,000| 100.00|    1|                    |
|                entities.hashtags                 |           1,000| 100.00|    2|                    |
|                  entities.media                  |             196|  19.60|    2|                    |
|                 entities.symbols                 |           1,000| 100.00|    2|                    |
|                 entities.trends                  |           1,000| 100.00|    2|                    |
|                  entities.urls                   |           1,000| 100.00|    2|                    |
|              entities.user_mentions              |           1,000| 100.00|    2|                    |
|                extended_entities                 |             196|  19.60|    1|                    |
|             extended_entities.media              |             196|  19.60|    2|                    |
|                  favorite_count                  |           1,000| 100.00|    1|                    |
|                    favorited                     |           1,000| 100.00|    1|                    |
|                   filter_level                   |           1,000| 100.00|    1|                    |
|                       geo                        |           1,000| 100.00|    1|                    |
|                 geo.coordinates                  |              18|   1.80|    2|                    |
|                     geo.type                     |              18|   1.80|    2|                    |
|                        id                        |           1,000| 100.00|    1|                    |
|                    id.bottom                     |           1,000| 100.00|    2|-hidden mongo field |
|                  id.floatApprox                  |           1,000| 100.00|    2|-hidden mongo field |
|                  ......etc.......                |           .....| ......|    .|                    |
........................................................................................................
>>>  for i in r[1]: print i:                                                               # print results
{u'_id': u'', u'value': {'notes': '', 'field': u'', u'cnt': 31, u'percent': 3.1000000000000005, u'depth': 1}}
etc. etc...
mongoUtils.schema.schema_meta(mr_keys_results, verbose=2)[source]

given the results returned by schema() function calculates and returns statistics for schema fields also pretty prints stats if verbose > 0 Be aware of hidden mongoDB fields that mongoDB uses internally

Parameters:
  • mr_keys_results: results tuples returned be schema
  • verbose 0 | 1
Returns:

list of statistics for each field

mongoUtils.schema.schema_exclude_parents(fields_list, as_string=True)[source]

useful for producing fields parameter for mongoexport

Parameters:
  • fields_list: a fields_list as produced by schema() function
  • as_string: True or False, converts output to string if True (default)
Returns:
  • last level elements of fields_list
Example:
>>> res, stats = sch.schema(db.muTest_tweets_users,verbose=0)
>>> res['value']['fields']
['_id', '_id.str', 'contributors_enabled', 'created_at', 'default_profile' .... ]
>>> schema_exclude_parents(res['value']['fields'])
'_id.str,contributors_enabled,created_at,default_profile,default_profile_image ...
mongoUtils.schema.mongoexport_fields(file_path, collection, query={}, excl_fields_lst=[])[source]

exports all field names except excl_fields_lst to a file

Parameters:
  • file_path: (str) path to output file
  • collection: a pymongo collection object
  • query: a pymongo query dictionary (optional) to restrict fields discovery to a subset of a collection (useful for large collections)
  • excl_fields_lst: (list) field names to exclude from output
Example:
>>> mongoexport_fields("/path_to_file", db.muTest_tweets_users,  excl_fields_lst=['_id'])