====== Mongo Findings ======
==== COMPATIBILITY ====
Due to the fact that we use different versions of mongodb, we need a way to know which PyMongo need to be used (Motor uses PyMongo as a dependency)
The first link goes to a table that shows the versions of PyMongo and it's compatibility
And even that, there's a possibility of still have some problems with authentication, just like the visualytics mongodb, so you need to specify the authentication on the database. To do this, just add the /?authSource=admin&authMechanism=SCRAM-SHA-1 in the string connection For the visualytics coverage problem, we used the PyMongo 3.12 version
Ex:
def get_data_mongo():
client = MongoClient("mongodb://picalike3:6fK5DnNFbhin@mongodb01.live.picalike.corpex-kunden.de/?authSource=picalike3&authMechanism=SCRAM-SHA-1")
db = client['picalike3']
return db
fonts:
* https://www.mongodb.com/docs/drivers/pymongo/
* https://www.mongodb.com/community/forums/t/mongosecurityexception-exception-authenticating-mongocredential/106790/7
tags: mongo, compatibility, auth failed, Authentication failed, code 18
==== CURSORS ====
When iterating over large amounts of data typically a cursor is used e.g.:
client = pymongo.MongoClient(DB_URI)
collection = client[db_name][collection_name]
filters = {...}
projection = {...}
for doc in collection.find(filters, projection):
process_document(doc)
This can cause problems when ''%%process_document(doc)%%'' takes long time. Iterating over the cursor for the mongo means fetching a batch of documents returning them and repeating this when the batch is empty, until the cursor has no more documents to fetch. If there is too much time between each fetch (10 minutes), the cursor will be closed automatically per default. To prevent this, one can adjust the find function to ''%%collection.find(filters, projection, no_cursor_timeout=True)%%''. HOWEVER this is dangerous! If the script crashes before the cursor is empty, it will not be closed automatically and stay open until the mongo server is restarted (forever for production systems). One problem is missing - even if one uses no_cursor_timeout, the session in which the cursor lives can expire. To prevent this we need to use a defined session and refresh it every x minutes (per default it would be removed after 30 minutes).
Clean solution to always close the cursor and use a defined session:
with client.start_session() as session:
with collection.find(filters, projection, no_cursor_timeout=True, session=session) as cursor:
last_refresh = datetime.now().timestamp()
for doc in cursor:
now = datetime.now().timestamp()
if now - last_refresh > 5 * 60:
res = client.admin.command('refreshSessions', [session.session_id], session=session)
last_refresh = now
process_document(doc)
compare: https://developpaper.com/four-solutions-to-the-problem-of-mongodb-cursor-timeout/
tags: CursorNotFound, cursor not found, cursor timeout, cursor closed
==== Sebastian findings ====
=== Mongo Knowlegde ===
* Why shouldn't I embed large arrays in my documents?: http://www.askasya.com/post/largeembeddedarrays/
* 24.01.22 findings psql vs MongoDB:
* offset + limit have a high influence on performance. the higher the offset, the slower. For example query of the loop has an average of 0.07s per query with an offset of 1000 and a limit of 30. mybestbrands_de_crawler has an average of 1.9s with an offset of 100000 and a limit of 30. walking over the table with offset can make a query so slow, that it gets killed.
* that's why a limit of 100 was the max per query. limit of 1000 takes too long if the offset is over 300.000 so big shops are hard to import into the mongo
* $lookup as aggregation for mongo works good, but is super slow, since it cant use index.
* this makes the use of 2 collections impossible for fast analytical queries (about_you_de_feed takes 40s)
* using only one collection seems to be the solution. all infos per document. size of index and collection higher than when using 2 collections (10GB(2cols) vs 13GB(1col) for 118m documents)
* inserts into history data into an array gets very slow over time. 10.000 never-out-of-stock products would need 20 seconds to complete a $push over all documents
* one document per history data takes 21s-23s for 100.000 documents als bulk constant from first bulk insert to 118m documents
* https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out (could be used directly in the aggregation end, to save to a cache collection)
* https://docs.mongodb.com/manual/core/views/ might also be a good way of caching queries
=== Product findings ===