incidents
Table of Contents
Incidents
- Date: 2020-11-10
- Host: index01-hpc
- Component: v3-billing
- Script: bin/distributer.py
- Problem: messages could not be sent to collection endpoints on frontend05-hpc. this created a growing and ultimately ended in out-of-memory (oom). this was documented in /var/log/syslog (or older files)
- Solution: added a semaphore with size 50
- Date: 2020-11-10
- Host: frontend05-hpc
- docker containers: top_viewed_api, also_viewed_api, get_cat_trends
- Problem: containers could not connect to local mongo (top_viewed_mongo). This problem has been encountered recently for the first time.
- Solution: place all container in a docker network (top_viewed_network)
- Cause: unknown, may related to docker storage engine changes (Corpex-Ticket: 6609)
- Date: 2020-11-27
- Host: all frontends
- Problem: strange sim api results, especially different results for different frontends
- Solution: call lilly sync from sg02 (check history for sync call)
- Cause: unknown, maybe related to vpn problems over the last days
- Date: 2020-12-01
- Host: all frontends
- Problem: The look API did not yield results for some products of a look (seen first for Otto AT, but later on for many shops).
- Cause: Incorrect handling of category names. Usually a category should be saved as is into the corresponding mongo collections (ic_<uid>, ic_7510, pci_styles) but for some reason special characters (Umlaute, <, >, etc.) were substituted by whitespace for some products (in some shops). While handling these categories in Look API and inside Lilly, they were not mapped by the category ID (which is also saved in the mongo) for each product, but encoded in the Lilly (crc32 function) which then searched for a wrong, not existing ID and yielded “category not existing”. It is still an open question why/how the category is saved this wrong way at all.
- Solution: We manually replaced the wrong categories by correct categories in ic_7510 and pci_styles and then changed the lilly_data_from_mongo script on sg02 (the former version is named lilly_data_from_mongo_20201202, check changes with diff command) to not use the wrong categories when exporting ic_905 (special solution for one of the affected customers) but the correct data given for each product.
- Open/ToDo : the special solution has to be replaced by importing the categories correctly. This must also include the fix for all other shops with that error.
- Additional Solution: for the corresponding shops, we added they entry “strip_chars” : false to the mongo user collection, which should fix this issue from the import (for both ic_* collection and pci_styles collection, not for ic_7510 which is not used any more)
- Date: 2020-12-19
- Host: report-engine
- Problem: slow answers from report-engine
- Cause: running view refresh + lack of disk space
- Solution: add more disk space
- Date: 2021-01-04
- Host: index04
- Problem: V3 Feed-Imports did not finish since 2021-01-03
- Cause: redis_export on index04 was missing..
- Log: index04:/home/picalike/var/log/kpo_collector.log
- 2021-01-04 09:06:25,709 ERROR [kpo] {index04.picalike.corpex-kunden.de} No label exporter of type _v4-labels-jrpc._tcp.local. with service level live found!
- 2021-01-04 09:06:30,712 ERROR [kpo] {index04.picalike.corpex-kunden.de} failed to send label notifications, will retry in 300.0000 seconds
- Solution: start redis_export (as described in TODO.restart)
- Date: 2021-01-15 *EDIT* and 2021-02-14
- Host: sg02
- Problem: redis-killer is down
- Exception:
Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 232, in background_thread self._connection.processEvents() File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 111, in processEvents self._connection.process_data_events(0) File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 650, in process_data_events self._flush_output(timer.is_ready, common_terminator) File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 411, in _flush_output self._impl.ioloop.process_timeouts() File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 283, in process_timeouts timer['callback']() File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 93, in signal_once assert not self._ready, '_CallbackResult was already set' AssertionError: _CallbackResult was already set
- Mitigation: restarted service
- Further Information: the rabbitmq-server is running since 2019 and the redis-killer skript has not been changed since 2019
- Date: 2021-01-18, 2021-04-14
- Host: frontend-hpc02, v22019026221283998.hotsrv.de
- Problem: heartbeat_monitor.py on sg01 says: LillySyncService:lilly.frontend02-hpc.picalike.corpex-kunden.de is down
- Mitigation: check the lilly directory on the machine (netcup: ls –sort=time -l /home/picalike/lilly-data/|head -n10, corpex: ls –sort=time /mnt/storage/var/lilly-data/ -l|head -n10) if there are recently changed directories. To check the logs, do NOT use the docker logs, but ~/log/{sync_data.log). If not fresh or in doubt:
docker restart frontend_instance1
; Additionally make sure that the frontend is responsive (Same-O-Same-O Mail).
- Exception
[2021-01-18 16:24:49 +0000] [19] [CRITICAL] WORKER TIMEOUT (pid:335) 2021-01-18 16:31:27,127 CRITICAL -- Connection close detected Exception in thread Thread-1: Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/local/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 232, in background_thread self._connection.processEvents() File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 111, in processEvents self._connection.process_data_events(0) File "/usr/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 650, in process_data_events self._flush_output(timer.is_ready, common_terminator) File "/usr/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 426, in _flush_output raise exceptions.ConnectionClosed() ConnectionClosed
#######################################################################################################
- Date: 2021-01-21
- Host: frontend06-hpc
- Component: new-api
- Problem: In F06, is running an old flask application named 'new_api'. From time to time, the partition /mnt/storage reaches 100 % usage, because this new_api logs a lot and does not have log-rotation.
- Solution:
- truncate -s 0 new_api.log works without restarting the service
- See for details about new_api: http://dokuwiki.picalike.corpex-kunden.de/f06-hpc_new_api
- Date: 2021-03-04
- Host: pci01
- Component: witt_reports
- Problem: witt_reports_container consumes 40% of memory after a while
- Solution: docker restart witt_reports_container
- Date: 2021-03-09
- Host: netcup servers
- Component: image_cloud
- Problem: the VPN connection was lost and no content was delivered
- Solution: restart openvpn
- Date: 2021-03-14
- Host: pci01
- Component: all on pci01
- Problem: VM was unresponsive and had to be restarted (maybe problem in feed_import or memory problem in witt_reports?)
- Solution: had to restart witt_reports and visualytics_notification_api the billing service also went down as collateral (needs pci on port 8090 to function) TODO_restart on pci01 was updated
- Date: 2021-03-18
- Host: frontend05-hpc
- Component: newly started docker services
- Problem: The uptime of the machine was > 700 days. We deployed a new container there and started the service. But we could not connect to the public TCP port. The service was correctly initialized because internally the port was reachable. Thus, we expected a problem with docker connect outside→inside.
- Solution: The whole server was restarted, but it might have sufficed if we restarted dockerd. But we do not know this for sure. After the restart, the service worked as expected.
- Date: 2021-03-29
- Host: frontend06-hpc
- Component: new_api
- Problem: logging filled /mnt/storage to the max
- Further Information: Widgets on madeleine.de stopped working!
- Solution: stopping new_api (echo “q” > /run/shm/new_api.ctrl), deleting log (rm /mnt/storage/var/log/new_api.log), starting new_api (uwsgi –ini /mnt/storage/var/etc/new_api.ini &)
- Date: 2021-04-23, 2021-05-05
- Host: dev02
- Component: v5_extractor
- Problem: the refresh procedure seems to be sometimes in a deadlock which means scheduled updates are not finished. This can be also seen by sorting the feature stores by time (ls -l –sort=time) and no updates within the last 24h. A symptom is that push_attrs.py did not transfer any attributes for a day.
- Further Information: It can be seen that the docker logs are mostly warnings and no actual processing steps. Without new features for products, no OSA queries or int scores are possible.
- Solution: Stop the service. Check if there are any journal files in /home/picalike/v5/v5_backend/feeds. If so, execute sqlite3 <shop_id>.db vacuum which applies the journal and after it, restart the service. Finally, refresh feed + crawler manually with the curl requests from the cronjob (dev02)
- Date: 2021-04-29
- Host: dev02
- Component: v5_extractor / push_attrs
- Problem: push_attrs was executed while v5_extractor worked on features. The database was locked and other components did not handle it correctly.
- Solution: A bugfix was deployed, but as a fallback docker restart
- Date: 2021-10-03
- Host: tegraboards
- Component: v4 feature extraction
- Problem: some tegraboards are not responding. (sometimes tegraboards are not responding in time to the monitoring skript, in this case no further steps are required)
- Solution:
- check if gpu_extractor is still working (are the new entries in the log under ~/var/log)
- if the gpu_extractor is stalled kill the current instance (ps aux | grep gpu_extractor) and get the command line to restart the service from “history | grep gpu_extractor”
- Date: 2021-05-10
- Host: tegraboards
- Component: v4 feature extraction
- Problem: some tegraboards are not working
- Solution: (how to swap tegra board usage of feature calculation)
- index03: deaktiviere v3 feed import in crontab HINT: nicht vergessen am ende wieder anzustellen
- index04:
- passe /mnt/storage/var/etc/v3/kpo_iterware.conf an
- checke kpo log (less /mnt/storage/var/log/kpo_collector.log)
- kille alle kpo_collector prozesse (ps aux | grep kpo_collector, kill …)
- checke TODO restart um kpo collector wieder zu starten
- checke dass das starten funktioniert hat , kann passieren dass der service sich nicht mit dem redis labels export verbinden kann und daher terminiert (dann nochmal starten)
- index03: aktiviere den feed import wieder
- Date: 2021-05-20
- Host: frontends/sg02
- Component: look api
- Problem: there was no result for a look api query given a product that is included in a look. in the logs one saw the message: “No styles found: no look fulfills the constraints” and, checking further, that the product in style collection ic_7510 was not similar enough to itself (in the product collection).
- Solution: we checked the features in the v4 image cloud and found that they are not equal for the product and the product as part of the style (product collection vs 7510 collection); then we found strange behavior in the lilly seeing that different products have been mapped to the same features, indicating that they have the same PHash. We confirmed this checking the phash mongo field of both products. to resolve the issue we than deactivated deduplication via the shop settings in the mongo, inserting <“deduplication” : false> to the settings field (compare uid: 3530 in the user collection).
- Date: 2021-05-31, 2021-09-15
- Host: frontends
- Component: frontend_instance1 container
- Problem: heartbeat_monitor.py on sg01 says: LillySyncService:lilly.v22019026221284001.happysrv.de is down. In the logs 'CRITICAL – Connection close detected'
- Solution: go to the affected frontend, execute curl “http://localhost:9000/disable”, wait until query to lilly stop (check log in ~/log/lilly.log), execute docker restart frontend_instance1
- Additional Hints: check http://sg01.picalike.corpex-kunden.de:5002/by_service
- Date: 2021-06-28, 2022-12-22
- Host: redis01
- Component: redis (symptom is that results differ between look.php?… vs look.php?…&redis=off)
- Problem: redis cache contains old data
- Solution: ssh redis01 then use
redis-cli -n 2 –scan –pattern '*<apikey>*
' to show all entries. you can pipe this intoxargs redis-cli -n 2 del
to actually delete:redis-cli -n 2 –scan –pattern '*<apikey>*' | xargs redis-cli -n 2 del
- Date: 2021-07-15
- Host: psql02
- Component: psql
- Problem: no space left → docker logs
- Solution: modified run start and restarted
- Date: 2021-07-22
- Host: netcup psql01
- Component: v5_cat_top_trends, v5_sim [all components that are using psql01]
- Problem: possible I/O problems and temporary not reachable which triggered the monitoring and lead to stalled v5 imports
- Solution: none, find better hosting provider [corpex ticket #10430]
- Date: 2021-08-10
- Host: psql02
- Component: Host/Net
- Problem: was not reachable via ping at 03:30 in the morning. But this might be just a sympton for a bigger issue
- Solution: none, it only happened once but never before which is why I created the incident
- Date: 2021-08-12
- Host: psql02
- Component: refresh_trend_data
- Problem: the database was in an inconstent state 'primary key violations'
- Solution: none, it only happened once but the cause is not known
- Date: 2021-08
- Host: most netcup servers
- Component: v5_image_picker
- Problem: the DB connection to cloud01:psql was gone
- Solution: restart, but unclear why it happened
- Date: 2021-08-20
- Host: shop conveor belt(?)
- Component: feed import
- Problem: All feed shops were stalled and did not finish which is why there was old data in the v5 psql backends. Might be related with the migration of the report engine or not.
- Solution: re-import triggered
- Date: 2021-08-24
- Host: pg02 (v220201062212128885.bestsrv.de)
- Component: live postgres host
- Problem: From 01:00-06:00 we got a lot of alerts due to high ping times to the host. At this time the daily live import was running and maybe partly responsible for the times. But since a full import is done every day, on Friday also with all jobs, it does not explain why this pattern occurred right now.
- Solution: None for now
- Date: 2021-08-23 19:00
- Host: cloud01
- Component: various, like image-cloud
- Problem: a lot of timeouts '[2021-08-23 19:02:20,248 ERROR/v5_extractor]: 140645008144128: HTTPConnectionPool(host='cloud01.picalike.corpex-kunden.de', port=5000): Read timed out. (read timeout=5)' probably due to a very high load. Cross reference: a lot of stalled shops were refreshed and might be terminated at this time.
- Solution: Fixed itself
- Date: 2021-09-14 12:42
- Host: multiple netcup server (frontend)
- Problem: fatal uwsgi errors related to the health check
Command '['uwsgi', '–socket', '/tmp/sim_api.sock', '–nagios']' returned non-zero exit status 2
- Solution: None. restarted
- Date: 2021-09-28
- Host: frontend05-hpc
- Problem: http://frontend05-hpc.picalike.corpex-kunden.de:8012/get_top_look?shop_id=cGljc2ltaWxhcjozNTMw&mode=top&num_entries=5&last_n_days=30 did not return results (empty list)
- Solution: the updatescript
/home/picalike/docker_bin/top_looks_fill_db/update_collections.py
was not running any more as one could see checking/mnt/storage/var/log/top_looks_update.log
. Started it withpython3 update_collections.py &> /mnt/storage/var/log/top_looks_update.log &
(compare docker restart script). waited till it finished the updates (took some minutes) and deleted the cache (mongo collection count_cache (!!CAREFUL - DONT DELETE ANOTHER COLLECTION!!), this is optional, otherwise wait one hour)
- Date: 2021-09-28 12:42
- Host: dev02
- Problem: very few products for a feed shop where we expect many products
- Solution: a preprocessor didnt run correctly due to RAM issues on sg01. increased the maschines RAM size. found the problem by looking at postgres tables and meta_db_collection using different timestamps.
- Date: 2021-10-04
- Host: frontend05-hpc
- Problem: status code 500 bei http://frontend05-hpc.picalike.corpex-kunden.de:5003/health
- Solution: the trends view had no entries for madeleine_de_feed which caused the health check to fail. manually triggered the /home/picalike/v5/scripts/refresh_trend_data.sh script on dev02 (make sure nothing with equal lock is running at the same time) to calc the missing int scores and refresh the view.
- Date: 2021-10-25
- Host: pci01
- Problem: SCB: feed_import failed for universal_at_feed: feed reader failed - details: exception while process, stats: None (from slack alerts),
- Exception:
Traceback (most recent call last):
File "/app/tasks/feed_reader.py", line 42, in read_feed failed, stats = process_feed(shop_id, feed_object, session, n_items=-1) File "/app/feed_processor.py", line 129, in process_feed for item_nr, data in enumerate(feed_object): File "/app/feed_objects/FeedObject.py", line 53, in __next__ row = self.reader.__next__() _csv.Error: field larger than field limit (131072) (from docker logs after retrying the shop) * **Solution**: inform customer about broken feed
- Date: 2021-11-16
- Host: sg01
- Component: pci_style_updater
- Problem: in alerts channel: heartbeat_monitor.py on sg01 says: style updater:sg01.picalike.corpex-kunden.de is down
- Solution: check if style_updater is still running, by checking the log: “tail style_updater.log” in home directory on sg01 (are there current entries? → if yes, then everything is fine). if the style_updater is really down, the v3 feed imports should start to pile up, because they send every product to the style updater
- Date: 2021-11-24, 2021-12-14
- Host: v22019026221284000 (netcup)
- Problem: reboot after migration / container restart → corpex monitoring failed (connection check port 9000)
- Solution: check if openvpn needs to be restarted on the server and do so (ssh as root then (1) curl “http://dev02.picalike.corpex-kunden.de:8006/health” if not ok (2) kill vpn process and use openvpn –log /root/vpn.log –daemon –config config.ovpn); restart ssh port forward on sandy (compare /home/picalike/missing_deit_features_worker/start_missing_worker.sh); docker restart image picker and frontend on the netcup server
- Date: 2021-12-01
- Host: sandy
- Problem: via alerting: ERROR http://sandy.picalike.corpex-kunden.de:20005 down: ('Connection aborted.'
- Solution: the local port forwarding via ssh is likely broken. check ps aux|grep 10005 [no typo!] and if it is missing grep for the port in '/home/picalike/missing_deit_features_worker/start_missing_worker.sh' and manually execute the single SSH command.
- Date: 2021-12-08
- Host: dev02 (core problem at image-cloud)
- Problem: v5_feed_extractor logs did show that image downloads failed for hm_de_crawler and asos_de_crawler with error code 403 (forbidden). Manually checking the urls on the local machine worked, also on a netcup machine, but did not work on any corpex machine (even changing the user-agent had no effect) → probably ip range block by 'akamai' who is the image host for both hm_de_crawler and asos_de_crawler (and also for other shops)
- Solution: image-cloud needs to use a tinyproxy (but only for shops hosted by akamai)
- Date: 2021-12-22
- Host: pci01
- Problem: alert at 7am that health endpoint of witt_reports container could not be reached:
ERROR http://pci01.picalike.corpex-kunden.de:8320 down: HTTPConnectionPool(host='pci01.picalike.corpex-kunden.de', port=8320): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f536f306e48>: Failed to establish a new connection: [Errno 111] Connection refused',))
- Solution: if there is no follow up alert, no action is required since the restart is done on purpose (see wiki)
- Date: 2022-01-01
- Host: a lot
- Problem: Juggling with dates. Instead of using deltas it was a mix of missing leading zeros 'W' instead of 'WW' and a combination of current year and last week like '202252'. As a result, views did not contain any data or wrong data.
- Solution: use proper PSQL functions instead of hard coding / manual combinations.
- Date: 2022-01-13
- Host: dev01 / psql
- Component: osa sim
- Problem: The product 5176524#withfeed_de_feed has no int score, but osa sim calls return candidates. The problem is that int scores are using the new cluster scheme, while the sim calls do not not. And with the clustering some top-k results won't be returned.
- Solution: none for the moment. without the new clustering queries are slow, but with them, some nearest neighbors won't be found.
- Date: 2022-01-13
- Host: Corpex-Frontends
- Component: V3
- Problem: We get a lot of requests (probably due to a Newsletter from witt). We get a lot of requests from 66.249.81.99 User-Agent: “Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)”
- Hint: tail -f /var/log/apache2/access-frontend01.picalike.corpex-kunden.de.log, vimail-stuff is synced from frontend04-hpc to the other frontend[01-03]-hpc
- Solution: None
- Date: 2022-01-14
- Host: dev02
- Component: Mongo / Feed Import / Top Trends / psql 02 live
- Problem: for at least two runs, no shop data could be read from the MongoDB. It is unclear what happened and where.
- Solution: None
- Date: 2022-01-19
- Host: pci01
- Component: PCI01 Server overall, noticed through Feed Import
- Problem: the storage was full (Feed Import Alert said “ feed_import_import failed for 'universal_at_feed' session '…': <class 'asyncpg.exceptions.DiskFullError'>”) → problem was the /mnt/storage/ directory which hosts both docker and the upci service. during inserting data into the feed import postgres, the storage went full.
- Hints: upci operations: upci
- Solution: docker system prune to immediately get some space, then reached out to corpex for more disk space. but the main problem was that the upci logging files grew to large, probably a problem with the log rotation. Still open: limit the log size in the service and truncate the logs.
- Date: 2022-01-19
- Host: pci01
- Component: Feed Import (export), Mapping Service
- Problem: too many calls of the mapping endpoints to insert genders (also possible with brands or categories) lead to a blocking of the mapping service (in this case 300k 'different' genders for one shop). If other shops then get imported they can not be finished as they wait and at some point raise asyncio.exceptions.TimeoutError. the blocking shop at some points raises aiohttp.client_exceptions.ServerDisconnectedError.
- Solution: We had to restart the mapping service to work again (~/docker_bin/mapping-service/ice_tea_run.sh). Still open: fixing the problem in the import / mapping service
- Date: 2022-02-03
- Host: sandy:8042
- Component: Image Picker
- Problem: in some cases the health check will block → timeout, log alert
- Solution: none right now, but the service is still working beside the blocking
2022-02-01
- Host: i03, tegra boards
- Component: v3 feed import, v4 feature extraction
- Problem: v4 feature extractor got stuck… on index03
python /mnt/storage/var/live/indexer/scripts/update/feedUpdateList.py /mnt/storage/var/etc/v3/feedUpdate.json
<HTML><p></HTML>will return a list where the V4 field no longer decreases, e.g.:<HTML></p></HTML>
c30d54b60ac625b74e2bb0c9ae2662b2 - 376, 0 of 5 packages pending (V4: 3381 urls to go), 01-Feb 07:22, export started: None
the number 3381 will never go down and there are more feeds queued waiting for the previous feed to finish
- Solution: one or all of the required services got stuck and needs to be restarted… possible culprits are:
- gpu_extractor on tegra boards
- maybe fs is read-only → reboot as user ubuntu, restart service as user picalike (use command history)
- disk full → kill process, rm log, restart service (takes a while))
- only start shape extractor or color extractor except on t1of1, use /mnt/storage/var/etc/v3/kpo_iterware.conf to check which service needs to run
- services on worker01:
- check age of log files, if too old maybe restart the service (TODO.restart)
- kpo_collector on index04
- make sure the redis_export is running
- restart the kpo_collector (make sure the kpo_collector finds the redis_exporter in the log file… zeroconf)
- restart all feed imports that have waiting features (cancel, retrigger update)
- Date: 2021-02-24
- Host: sandy / netcup
- Component: v5 features
- Script: image_picker/missing_worker.py
- Problem: netcup services responded with 500 since the postgresql connection could not be recovered
- Solution: restarted the docker contains
- Date: 2022-03-11
- Host: pci01
- Component: v5 feed import (export) & Shop Conveyor Belt
- Problem: (many) shops are hanging in the feed_import_export stage for a long time (which can be seen in the SCB)
- Solution: restarted the docker containers via
/home/picalike/docker_bin/feed-import/run.sh
. Then, for each shop: (1) in the SCB set busy state to false (2) go tohttp://pci01.picalike.corpex-kunden.de:1337/docs
and use the/delete_shop_session
endpoint using the session from the SCB and setting delete_stats to true (3) restart the shop from SCB (all stages starting with feed_import_import)
- Date: 2022-03-22
- Host: dev02
- Component: v5_extractor
- Problem: batches are grouped by image URLs and at least one image URL is present ~2,000 which leads to a batch size that exceeds the available memory.
- Solution: For now, those blocks are not processed which is no real solution but does not abort the extraction
- Date: 2022-03-22
- Host: psql02 (netcup)
- Component: postgresql
- Problem: the simshot matview grow larger and larger (20 GB → 272 GB)
- Solution: vacuum did not help, so it was created again drop + create (while taking care that there are no transactions using this table)
- Date: 2022-05-19 (?)
- Host: report-db01 → netcup, sandy
- Component: postgresql (?)
- Problem: all connections to the feature DB were 'lost'. This broke the missing_worker.py on sandy and all the worker (v5_image_picker) nodes.
- Solution: restart of all docker contains plus the script
- Date: 2022-05-25
- Host: v3 frontends
- Component: lilly
- Problem: '2022-05-25T12:56 UWSGI CRITICAL: could not connect() to workers Connection refused'. it seemed that partly the connection to corpex was unreliable, but the VPN service was started. The containers also had a 'created' timestamp of ~10 min. auto restart by 'mini_health_check.sh' (sg01)
- Solution restarted docker
- Date: 2022-04-13 (detected: 2022-05-30!)
- Host: index04
- Component: redis_export
- Problem: after an exception 'ResponseError: OOM command not allowed when used memory > 'maxmemory' File “/home/picalike/.local/lib/python2.7/site-packages/collector/service.py”, line 125, in run missing = self.proc.push([item]) no new log entries were written. the process is likely in an error state
- Solution: restarted service according to TODO.restart, after killing it
- Date: 2022-07-26
- Host: report-engine
- Component: report-engine
- Problem: http://report-engine.picalike.corpex-kunden.de:4545/osa_middleware/reports/get_osa_user_settings seems to hang. restarting the database worked last time. try to find the actual cause! – greetings from osa-prelive.picalike.corpex-kunden.de
- Solution: Not sure …
- Date: 2022-08-03
- Host: image-cloud(s)
- Component: image-cloud
- Problem: log alert 'domain for https://images.goertz.de/is/image/Goertzmedia/TOMMY-JEANS-TJM-Heritage-Guertel-schwarz~99503001~back~ADS-HB.jpg blocked using tiny proxy. please manually analyze and adapt SHIFTER_DOMAINS if needed' … actually the image is not available 'illegal image size' and 403 should be actually 404.
- Solution: no(!) need to add it to shifter, since other images work as expected
- Date: 2022-08-15
- Host: sandy
- Component: deit features (extractor)
- Problem: The feature DB container has been restarted, likely by an sys update and some services did not recover 'psycopg2.errors.AdminShutdown: terminating connection due to administrator command'. An alert is usually triggered with port 2000*.
- Solution: Check that the DB is up and running, then restart the missing_worker.py and the more painful task to restart all workers worker nodes
- Date: 2022-09-01
- Host: ic*
- Component: image cloud
- Problem: For the domain image1.lacoste.com we have various errors. some images are blocked (403) but some lead too bad request (400) while some can be normally downloaded (200). and even after the blocking of some images, some still can be downloaded.
- Solution: Using shifter if all requests are blocked
- Date: 2022-09-05
- Host: netcup servers
- Component: missing_worker, v5_image_picker (worker)
- Problem: the mongo connection was lost and all workers 'crashed': [sandy.picalike.corpex-kunden.de]: http://localhost:10007: health check failed with code 500
- Solution: docker restart v5_image_picker on all front-ends. check sandy:~/image_picker/nohup.out if there are still 500 errors, kill and restart (see TODO_restart)
- Date: 2022-09-12
- Host: dev01, sandy
- Component: v5_image_picker, color_extractor
- Problem: color extractor relies on person detection and the service (v5_image_picker) was down and thus health generated 'not ok' because of the 500 response. why the 500 response … DB restart and DB connection was not reconnected.
- Solution: None yet. A restart of the image picker solved the problem, but is no solution!
- Date: 2022-09-15
- Host: sandy
- Component: unclear / system
- Problem: the ssh port forwarding 1000-10005 were somehow terminated and an alert for 20000-20005 (socat public port) were triggered
- Solution: restarted ssh connections ~/image_picker/start_missing_worker.sh
- Date: 2022-09-28 (2022-09-30 detected)
- Host: index04
- Component redis_export
- Problem: redis_export.log was too old, the reason was an exception “ResponseError: OOM command not allowed when used memory > 'maxmemory'”
- Solution: kill process and restart according to TODO restart
- Date: 2022-09-29 (2022-09-30)
- Host: sg01 (?)
- Component: SCB
- Problem: all madeleine shops were hanging at the mapping service. the scheduling seemed broken, since the mapping service worked.
- Solution: set busy to false and restarted all services in the shop conveyor belt
- Date: 2022-10-11
- Host: pci
- Component: feed_import_export
- Problem: feed_import_export failed for 'zalando_de_crawler' session '4164411c17d548ff8b42e30c9a9bd1bb': <class 'asyncio.exceptions.TimeoutError'>
- Solution: check that busy is false and restarted all services from feed import export on (no QA …)
- Date: 2022-10-11
- Host: pci01
- Component: mapping_service
- Problem: mapping service [2022-10-11 15:01: INFO/USER] get_category_lookup_v3 - loaded 4846 picalike categories in 815770 ms … 14 mins instead of the usual ~100 msecs.
- Solution: None so far. Also the feed import timed-out twice
- Date: 2022-10-13
- Host: pci01
- Component: feed_import
- Problem: the mongo insert into the history / metadb about 15 min per batch and speed never increased
- Solution: if possible wait for the other import/exports to finish, then restart the container and run swiss_army_knife/v5/restart_feed_import_export.py to restart the pending shops.
- Date: 2022-10-13
- Host: pci01
- Component: feed_import, mapping_service
- Problem: the feed import sends single requests for brand/gender/category which overloaded the mapping service. due to time-outs zalando constantly failed to import.
- Solution: Both components were modified to allow batching.
- Date: 2022-11-14
- Host: v22019026221283999.supersrv.de
- Component: tinyproxy
- Problem: logfile contained binary stuff
- Solution: restart
The following shops are older than 2 days in the goldmaster lilly
- goldmaster lilly Host: sg02
- common offending shops, where the alert can be ignored, if it only occurs once or twice:
- 2220: test feed from sheego… the update gets triggered externally (not every day)
SCB: <shop_id> <update_id>: unable to start feed_import_export
- Date: 2022-10-25
- Host: pci01 v5-feed-import
- Problem: shop-conveyor-belt indicates that the feed-import-export stage is in “waiting” state
- Solution: restart all stages from feed-import-import again
all docker containers restarted
- Date: any
- Host: any
- Problem: all docker containers on a machine have been restarted
- Solution: unattended-updates upgraded were performed for the docker daemon and it was restarted. check /var/log/unattended-upgrades/unattended-upgrades-dpkg.log and search for docker
simtwins
- Date: 2022-11-18
- Host: dev02
- Problem: [simtwins] no recent .st file, last is from 2022-11-16 13:54 - check update_simtwins.sh
- Solution: check if crontab is enabled: 0 9,16 * * * /home/picalike/v5/scripts/update_simtwins.sh
product twins import crashed
- Date: 2022-12-07
- Host: report-db01
- Problem: [report-db01.picalike.corpex-kunden.de] product twins: importing about_you_de_crawler crashed: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
- Hint: unattended upgrades of docker related stuff leads to restart of containers
- Solution: retrigger the import for the affected shop:
- on report-db01: check if process_shop.py is still running → kill
- restart process_shop.py in background (check crontab -l for command line)
- check the docker container that there are shops being processed in the logs
docker 'network' was unreachable
- Date: 2023-02-06
- Host: dev01
- Problem: no network connection from outside to inside docker was possible 'curl: (56) Recv failure: Connection reset by peer'. this seems to be a bug in docker v18
- Solution: corpex was contacted. a restart of the daemon did not help, but a restart of the whole M seemed to fix the problem. The VM needs a debian update to avoid further problems.
image cloud fire+forget: DB timeout
- Date: 2022-12-20
- Host: ic01
- Problem: The number of clone requests were growing and we got a slack alert. The reason why this happened was a database time-out (status code 408) visible in the docker logs. We suspect an asyncio problem.
- Solution: We restarted the docker container. This fixed the problem for the moment, but the reasony why this happened is not clear.
sync services containers down
- Date: 2022-12-20
- Host: report-engine
- Problem: Due to an unintended update on the machine, the containers was took down and didn't restart, halting the SCB process pipeline.
- Solution: To restart the containers at once, is to run the CI/CD of the project on Gitlab, it deploys the container and let than run again. After this, you need to run the script restart_scb.py on the OSA project, than the services that was running/waiting go back again on the containers.
incidents.txt · Last modified: 2024/04/11 14:23 by 127.0.0.1