Table of Contents
Mongo Findings
COMPATIBILITY
Due to the fact that we use different versions of mongodb, we need a way to know which PyMongo need to be used (Motor uses PyMongo as a dependency)
The first link goes to a table that shows the versions of PyMongo and it's compatibility
And even that, there's a possibility of still have some problems with authentication, just like the visualytics mongodb, so you need to specify the authentication on the database. To do this, just add the /?authSource=admin&authMechanism=SCRAM-SHA-1 in the string connection For the visualytics coverage problem, we used the PyMongo 3.12 version
Ex:
def get_data_mongo(): client = MongoClient("mongodb://picalike3:6fK5DnNFbhin@mongodb01.live.picalike.corpex-kunden.de/?authSource=picalike3&authMechanism=SCRAM-SHA-1") db = client['picalike3'] return db
fonts:
tags: mongo, compatibility, auth failed, Authentication failed, code 18
CURSORS
When iterating over large amounts of data typically a cursor is used e.g.:
client = pymongo.MongoClient(DB_URI) collection = client[db_name][collection_name] filters = {...} projection = {...} for doc in collection.find(filters, projection): process_document(doc)
This can cause problems when process_document(doc)
takes long time. Iterating over the cursor for the mongo means fetching a batch of documents returning them and repeating this when the batch is empty, until the cursor has no more documents to fetch. If there is too much time between each fetch (10 minutes), the cursor will be closed automatically per default. To prevent this, one can adjust the find function to collection.find(filters, projection, no_cursor_timeout=True)
. HOWEVER this is dangerous! If the script crashes before the cursor is empty, it will not be closed automatically and stay open until the mongo server is restarted (forever for production systems). One problem is missing - even if one uses no_cursor_timeout, the session in which the cursor lives can expire. To prevent this we need to use a defined session and refresh it every x minutes (per default it would be removed after 30 minutes).
Clean solution to always close the cursor and use a defined session:
with client.start_session() as session: with collection.find(filters, projection, no_cursor_timeout=True, session=session) as cursor: last_refresh = datetime.now().timestamp() for doc in cursor: now = datetime.now().timestamp() if now - last_refresh > 5 * 60: res = client.admin.command('refreshSessions', [session.session_id], session=session) last_refresh = now process_document(doc)
compare: https://developpaper.com/four-solutions-to-the-problem-of-mongodb-cursor-timeout/
tags: CursorNotFound, cursor not found, cursor timeout, cursor closed
Sebastian findings
Mongo Knowlegde
- Why shouldn't I embed large arrays in my documents?: http://www.askasya.com/post/largeembeddedarrays/
- 24.01.22 findings psql vs MongoDB:
- offset + limit have a high influence on performance. the higher the offset, the slower. For example query of the loop has an average of 0.07s per query with an offset of 1000 and a limit of 30. mybestbrands_de_crawler has an average of 1.9s with an offset of 100000 and a limit of 30. walking over the table with offset can make a query so slow, that it gets killed.
- that's why a limit of 100 was the max per query. limit of 1000 takes too long if the offset is over 300.000 so big shops are hard to import into the mongo
- $lookup as aggregation for mongo works good, but is super slow, since it cant use index.
- this makes the use of 2 collections impossible for fast analytical queries (about_you_de_feed takes 40s)
- using only one collection seems to be the solution. all infos per document. size of index and collection higher than when using 2 collections (10GB(2cols) vs 13GB(1col) for 118m documents)
- inserts into history data into an array gets very slow over time. 10.000 never-out-of-stock products would need 20 seconds to complete a $push over all documents
- one document per history data takes 21s-23s for 100.000 documents als bulk constant from first bulk insert to 118m documents
- https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out (could be used directly in the aggregation end, to save to a cache collection)
- https://docs.mongodb.com/manual/core/views/ might also be a good way of caching queries
Product findings
<HTML><ol></HTML>
- each product has only one document. Product history is saved as object inside an array.
- uses the less amount of memory but Björn's guess was right: each $push into the history array makes the $push slower till it gets too slow. having 10.000 documents with a history array = 100 makes just this process takes about 10 seconds. doesn't sound that bad. but since it gets slower the more you push into the array and we could end up with more than 1.000.000 products with > 100 entries in their history array, time could explode. so this is not really the way to go.
- each product has multiple documents. Each time a product is found, a document is created
- uses the most amount of memory but is the fastest. no need for $lookup, all centralized but we have categories, attributes, genders, name, pord_id, extra_internal_id, image,… all multiple times since each document has everything about the product. another drawback is, it needs to $group, so we get distinct items.
- each product has an entry in the product collection and its history is added to another collection (product & history collection)
- takes more memory than case 1 but a lot less than case 2. it takes only about 1/3 as much memory as case 2 but is 4x slower since we need to use $lookup to combine documents.<HTML></ol></HTML>
- Scenario if stick with case 2: at the moment we have about 1,5 years of data and use about 60GB. scaling our crawler up to daily would make a need of 7days x 2times for keeping data for 3 years. since i forgot to add attributes to the documents and I haven't imported all to the mongo yet, i would guess, we need about 80GB for all 1,5 years of data at the moment. multiplying it by 14 (7days and 2 times to get 3 years), we would end up needing about 1.1TB for the number of shops we have now. this makes case 2 the best selection for the near future.
- Reducing the memory fingerprint of case 2:
- we could create one or more collections only for categories, attributes, genders, and set IDs and in the items collection just save the ID instead of the hole category, attributes.
- we could only save the deepest categories, from the same category path, for each product as an array. than searching only for the last position in the category tree.
- get product name, deeplink, image, and all fields that are not used for filtering from a product-metadata collection. using $lookup as last in aggregation for the search (search is the only part that needs product data like name, deeplink, image) should be fast enough
- Doing these optimizations should reduce the memory needed as far as 20-30%