Incidents

Incidents

Date: 2020-11-10
Host: index01-hpc
Component: v3-billing
Script: bin/distributer.py
Problem: messages could not be sent to collection endpoints on frontend05-hpc. this created a growing and ultimately ended in out-of-memory (oom). this was documented in /var/log/syslog (or older files)
Solution: added a semaphore with size 50

Date: 2020-11-10
Host: frontend05-hpc
docker containers: top_viewed_api, also_viewed_api, get_cat_trends
Problem: containers could not connect to local mongo (top_viewed_mongo). This problem has been encountered recently for the first time.
Solution: place all container in a docker network (top_viewed_network)
Cause: unknown, may related to docker storage engine changes (Corpex-Ticket: 6609)

Date: 2020-11-27
Host: all frontends
Problem: strange sim api results, especially different results for different frontends
Solution: call lilly sync from sg02 (check history for sync call)
Cause: unknown, maybe related to vpn problems over the last days

Date: 2020-12-01
Host: all frontends
Problem: The look API did not yield results for some products of a look (seen first for Otto AT, but later on for many shops).
Cause: Incorrect handling of category names. Usually a category should be saved as is into the corresponding mongo collections (ic_<uid>, ic_7510, pci_styles) but for some reason special characters (Umlaute, <, >, etc.) were substituted by whitespace for some products (in some shops). While handling these categories in Look API and inside Lilly, they were not mapped by the category ID (which is also saved in the mongo) for each product, but encoded in the Lilly (crc32 function) which then searched for a wrong, not existing ID and yielded “category not existing”. It is still an open question why/how the category is saved this wrong way at all.
Solution: We manually replaced the wrong categories by correct categories in ic_7510 and pci_styles and then changed the lilly_data_from_mongo script on sg02 (the former version is named lilly_data_from_mongo_20201202, check changes with diff command) to not use the wrong categories when exporting ic_905 (special solution for one of the affected customers) but the correct data given for each product.
Open/ToDo : the special solution has to be replaced by importing the categories correctly. This must also include the fix for all other shops with that error.
Additional Solution: for the corresponding shops, we added they entry “strip_chars” : false to the mongo user collection, which should fix this issue from the import (for both ic_* collection and pci_styles collection, not for ic_7510 which is not used any more)

Date: 2020-12-19
Host: report-engine
Problem: slow answers from report-engine
Cause: running view refresh + lack of disk space
Solution: add more disk space

Date: 2021-01-04
Host: index04
Problem: V3 Feed-Imports did not finish since 2021-01-03
Cause: redis_export on index04 was missing..
Log: index04:/home/picalike/var/log/kpo_collector.log
2021-01-04 09:06:25,709 ERROR [kpo] {index04.picalike.corpex-kunden.de} No label exporter of type _v4-labels-jrpc._tcp.local. with service level live found!
2021-01-04 09:06:30,712 ERROR [kpo] {index04.picalike.corpex-kunden.de} failed to send label notifications, will retry in 300.0000 seconds
Solution: start redis_export (as described in TODO.restart)

Date: 2021-01-15 *EDIT* and 2021-02-14
Host: sg02
Problem: redis-killer is down
Exception:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 232, in background_thread
    self._connection.processEvents()
  File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 111, in processEvents
    self._connection.process_data_events(0)
  File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 650, in process_data_events
    self._flush_output(timer.is_ready, common_terminator)
  File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 411, in _flush_output
    self._impl.ioloop.process_timeouts()
  File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 283, in process_timeouts
    timer['callback']()
  File "/home/picalike/.local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 93, in signal_once
    assert not self._ready, '_CallbackResult was already set'
AssertionError: _CallbackResult was already set

Mitigation: restarted service
Further Information: the rabbitmq-server is running since 2019 and the redis-killer skript has not been changed since 2019

Date: 2021-01-18, 2021-04-14
Host: frontend-hpc02, v22019026221283998.hotsrv.de
Problem: heartbeat_monitor.py on sg01 says: LillySyncService:lilly.frontend02-hpc.picalike.corpex-kunden.de is down
Mitigation: check the lilly directory on the machine (netcup: ls –sort=time -l /home/picalike/lilly-data/|head -n10, corpex: ls –sort=time /mnt/storage/var/lilly-data/ -l|head -n10) if there are recently changed directories. To check the logs, do NOT use the docker logs, but ~/log/{sync_data.log). If not fresh or in doubt: docker restart frontend_instance1; Additionally make sure that the frontend is responsive (Same-O-Same-O Mail).

Exception

[2021-01-18 16:24:49 +0000] [19] [CRITICAL] WORKER TIMEOUT (pid:335)
2021-01-18 16:31:27,127 CRITICAL -- Connection close detected
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 232, in background_thread
    self._connection.processEvents()
  File "/mnt/storage/var/live/lilly_sync_data/lib/picapika/rabbitclient.py", line 111, in processEvents
    self._connection.process_data_events(0)
  File "/usr/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 650, in process_data_events
    self._flush_output(timer.is_ready, common_terminator)
  File "/usr/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 426, in _flush_output
    raise exceptions.ConnectionClosed()
ConnectionClosed

#######################################################################################################

Date: 2021-01-21
Host: frontend06-hpc
Component: new-api
Problem: In F06, is running an old flask application named 'new_api'. From time to time, the partition /mnt/storage reaches 100 % usage, because this new_api logs a lot and does not have log-rotation.
Solution:

truncate -s 0 new_api.log works without restarting the service
See for details about new_api: http://dokuwiki.picalike.corpex-kunden.de/f06-hpc_new_api

Date: 2021-03-04
Host: pci01
Component: witt_reports
Problem: witt_reports_container consumes 40% of memory after a while
Solution: docker restart witt_reports_container

Date: 2021-03-09
Host: netcup servers
Component: image_cloud
Problem: the VPN connection was lost and no content was delivered
Solution: restart openvpn

Date: 2021-03-14
Host: pci01
Component: all on pci01
Problem: VM was unresponsive and had to be restarted (maybe problem in feed_import or memory problem in witt_reports?)
Solution: had to restart witt_reports and visualytics_notification_api the billing service also went down as collateral (needs pci on port 8090 to function) TODO_restart on pci01 was updated

Date: 2021-03-18
Host: frontend05-hpc
Component: newly started docker services
Problem: The uptime of the machine was > 700 days. We deployed a new container there and started the service. But we could not connect to the public TCP port. The service was correctly initialized because internally the port was reachable. Thus, we expected a problem with docker connect outside→inside.
Solution: The whole server was restarted, but it might have sufficed if we restarted dockerd. But we do not know this for sure. After the restart, the service worked as expected.

Date: 2021-03-29
Host: frontend06-hpc
Component: new_api
Problem: logging filled /mnt/storage to the max
Further Information: Widgets on madeleine.de stopped working!
Solution: stopping new_api (echo “q” > /run/shm/new_api.ctrl), deleting log (rm /mnt/storage/var/log/new_api.log), starting new_api (uwsgi –ini /mnt/storage/var/etc/new_api.ini &)

Date: 2021-04-23, 2021-05-05
Host: dev02
Component: v5_extractor
Problem: the refresh procedure seems to be sometimes in a deadlock which means scheduled updates are not finished. This can be also seen by sorting the feature stores by time (ls -l –sort=time) and no updates within the last 24h. A symptom is that push_attrs.py did not transfer any attributes for a day.
Further Information: It can be seen that the docker logs are mostly warnings and no actual processing steps. Without new features for products, no OSA queries or int scores are possible.
Solution: Stop the service. Check if there are any journal files in /home/picalike/v5/v5_backend/feeds. If so, execute sqlite3 <shop_id>.db vacuum which applies the journal and after it, restart the service. Finally, refresh feed + crawler manually with the curl requests from the cronjob (dev02)

Date: 2021-04-29
Host: dev02
Component: v5_extractor / push_attrs
Problem: push_attrs was executed while v5_extractor worked on features. The database was locked and other components did not handle it correctly.
Solution: A bugfix was deployed, but as a fallback docker restart

Date: 2021-10-03
Host: tegraboards
Component: v4 feature extraction
Problem: some tegraboards are not responding. (sometimes tegraboards are not responding in time to the monitoring skript, in this case no further steps are required)
Solution:

check if gpu_extractor is still working (are the new entries in the log under ~/var/log)
if the gpu_extractor is stalled kill the current instance (ps aux | grep gpu_extractor) and get the command line to restart the service from “history | grep gpu_extractor”

Date: 2021-05-10
Host: tegraboards
Component: v4 feature extraction
Problem: some tegraboards are not working
Solution: (how to swap tegra board usage of feature calculation)

index03: deaktiviere v3 feed import in crontab HINT: nicht vergessen am ende wieder anzustellen
index04:

passe /mnt/storage/var/etc/v3/kpo_iterware.conf an
checke kpo log (less /mnt/storage/var/log/kpo_collector.log)
kille alle kpo_collector prozesse (ps aux | grep kpo_collector, kill …)
checke TODO restart um kpo collector wieder zu starten
checke dass das starten funktioniert hat , kann passieren dass der service sich nicht mit dem redis labels export verbinden kann und daher terminiert (dann nochmal starten)
index03: aktiviere den feed import wieder

Date: 2021-05-20
Host: frontends/sg02
Component: look api
Problem: there was no result for a look api query given a product that is included in a look. in the logs one saw the message: “No styles found: no look fulfills the constraints” and, checking further, that the product in style collection ic_7510 was not similar enough to itself (in the product collection).
Solution: we checked the features in the v4 image cloud and found that they are not equal for the product and the product as part of the style (product collection vs 7510 collection); then we found strange behavior in the lilly seeing that different products have been mapped to the same features, indicating that they have the same PHash. We confirmed this checking the phash mongo field of both products. to resolve the issue we than deactivated deduplication via the shop settings in the mongo, inserting <“deduplication” : false> to the settings field (compare uid: 3530 in the user collection).

Date: 2021-05-31, 2021-09-15
Host: frontends
Component: frontend_instance1 container
Problem: heartbeat_monitor.py on sg01 says: LillySyncService:lilly.v22019026221284001.happysrv.de is down. In the logs 'CRITICAL – Connection close detected'
Solution: go to the affected frontend, execute curl “http://localhost:9000/disable”, wait until query to lilly stop (check log in ~/log/lilly.log), execute docker restart frontend_instance1
Additional Hints: check http://sg01.picalike.corpex-kunden.de:5002/by_service

Date: 2021-06-28, 2022-12-22
Host: redis01
Component: redis (symptom is that results differ between look.php?… vs look.php?…&redis=off)
Problem: redis cache contains old data
Solution: ssh redis01 then use redis-cli -n 2 –scan –pattern '*<apikey>*' to show all entries. you can pipe this into xargs redis-cli -n 2 del to actually delete: redis-cli -n 2 –scan –pattern '*<apikey>*' | xargs redis-cli -n 2 del

Date: 2021-07-15
Host: psql02
Component: psql
Problem: no space left → docker logs
Solution: modified run start and restarted

Date: 2021-07-22
Host: netcup psql01
Component: v5_cat_top_trends, v5_sim [all components that are using psql01]
Problem: possible I/O problems and temporary not reachable which triggered the monitoring and lead to stalled v5 imports
Solution: none, find better hosting provider [corpex ticket #10430]

Date: 2021-08-10
Host: psql02
Component: Host/Net
Problem: was not reachable via ping at 03:30 in the morning. But this might be just a sympton for a bigger issue
Solution: none, it only happened once but never before which is why I created the incident

Date: 2021-08-12
Host: psql02
Component: refresh_trend_data
Problem: the database was in an inconstent state 'primary key violations'
Solution: none, it only happened once but the cause is not known

Date: 2021-08
Host: most netcup servers
Component: v5_image_picker
Problem: the DB connection to cloud01:psql was gone
Solution: restart, but unclear why it happened

Date: 2021-08-20
Host: shop conveor belt(?)
Component: feed import
Problem: All feed shops were stalled and did not finish which is why there was old data in the v5 psql backends. Might be related with the migration of the report engine or not.
Solution: re-import triggered

Date: 2021-08-24
Host: pg02 (v220201062212128885.bestsrv.de)
Component: live postgres host
Problem: From 01:00-06:00 we got a lot of alerts due to high ping times to the host. At this time the daily live import was running and maybe partly responsible for the times. But since a full import is done every day, on Friday also with all jobs, it does not explain why this pattern occurred right now.
Solution: None for now

Date: 2021-08-23 19:00
Host: cloud01
Component: various, like image-cloud
Problem: a lot of timeouts '[2021-08-23 19:02:20,248 ERROR/v5_extractor]: 140645008144128: HTTPConnectionPool(host='cloud01.picalike.corpex-kunden.de', port=5000): Read timed out. (read timeout=5)' probably due to a very high load. Cross reference: a lot of stalled shops were refreshed and might be terminated at this time.
Solution: Fixed itself

Date: 2021-09-14 12:42
Host: multiple netcup server (frontend)
Problem: fatal uwsgi errors related to the health check Command '['uwsgi', '–socket', '/tmp/sim_api.sock', '–nagios']' returned non-zero exit status 2
Solution: None. restarted

Date: 2021-09-28
Host: frontend05-hpc
Problem: http://frontend05-hpc.picalike.corpex-kunden.de:8012/get_top_look?shop_id=cGljc2ltaWxhcjozNTMw&mode=top&num_entries=5&last_n_days=30 did not return results (empty list)
Solution: the updatescript /home/picalike/docker_bin/top_looks_fill_db/update_collections.py was not running any more as one could see checking /mnt/storage/var/log/top_looks_update.log. Started it with python3 update_collections.py &> /mnt/storage/var/log/top_looks_update.log & (compare docker restart script). waited till it finished the updates (took some minutes) and deleted the cache (mongo collection count_cache (!!CAREFUL - DONT DELETE ANOTHER COLLECTION!!), this is optional, otherwise wait one hour)

Date: 2021-09-28 12:42
Host: dev02
Problem: very few products for a feed shop where we expect many products
Solution: a preprocessor didnt run correctly due to RAM issues on sg01. increased the maschines RAM size. found the problem by looking at postgres tables and meta_db_collection using different timestamps.

Date: 2021-10-04
Host: frontend05-hpc
Problem: status code 500 bei http://frontend05-hpc.picalike.corpex-kunden.de:5003/health
Solution: the trends view had no entries for madeleine_de_feed which caused the health check to fail. manually triggered the /home/picalike/v5/scripts/refresh_trend_data.sh script on dev02 (make sure nothing with equal lock is running at the same time) to calc the missing int scores and refresh the view.

Date: 2021-10-25
Host: pci01
Problem: SCB: feed_import failed for universal_at_feed: feed reader failed - details: exception while process, stats: None (from slack alerts),
Exception:

Traceback (most recent call last):

  File "/app/tasks/feed_reader.py", line 42, in read_feed
    failed, stats = process_feed(shop_id, feed_object, session, n_items=-1)
  File "/app/feed_processor.py", line 129, in process_feed
    for item_nr, data in enumerate(feed_object):
  File "/app/feed_objects/FeedObject.py", line 53, in __next__
    row = self.reader.__next__()
_csv.Error: field larger than field limit (131072) (from docker logs after retrying the shop)

* **Solution**: inform customer about broken feed

Date: 2021-11-16
Host: sg01
Component: pci_style_updater
Problem: in alerts channel: heartbeat_monitor.py on sg01 says: style updater:sg01.picalike.corpex-kunden.de is down
Solution: check if style_updater is still running, by checking the log: “tail style_updater.log” in home directory on sg01 (are there current entries? → if yes, then everything is fine). if the style_updater is really down, the v3 feed imports should start to pile up, because they send every product to the style updater

Date: 2021-11-24, 2021-12-14
Host: v22019026221284000 (netcup)
Problem: reboot after migration / container restart → corpex monitoring failed (connection check port 9000)
Solution: check if openvpn needs to be restarted on the server and do so (ssh as root then (1) curl “http://dev02.picalike.corpex-kunden.de:8006/health” if not ok (2) kill vpn process and use openvpn –log /root/vpn.log –daemon –config config.ovpn); restart ssh port forward on sandy (compare /home/picalike/missing_deit_features_worker/start_missing_worker.sh); docker restart image picker and frontend on the netcup server

Date: 2021-12-01
Host: sandy
Problem: via alerting: ERROR http://sandy.picalike.corpex-kunden.de:20005 down: ('Connection aborted.'
Solution: the local port forwarding via ssh is likely broken. check ps aux|grep 10005 [no typo!] and if it is missing grep for the port in '/home/picalike/missing_deit_features_worker/start_missing_worker.sh' and manually execute the single SSH command.

Date: 2021-12-08
Host: dev02 (core problem at image-cloud)
Problem: v5_feed_extractor logs did show that image downloads failed for hm_de_crawler and asos_de_crawler with error code 403 (forbidden). Manually checking the urls on the local machine worked, also on a netcup machine, but did not work on any corpex machine (even changing the user-agent had no effect) → probably ip range block by 'akamai' who is the image host for both hm_de_crawler and asos_de_crawler (and also for other shops)
Solution: image-cloud needs to use a tinyproxy (but only for shops hosted by akamai)

Date: 2021-12-22
Host: pci01
Problem: alert at 7am that health endpoint of witt_reports container could not be reached:

ERROR http://pci01.picalike.corpex-kunden.de:8320 down: HTTPConnectionPool(host='pci01.picalike.corpex-kunden.de', port=8320): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f536f306e48>: Failed to establish a new connection: [Errno 111] Connection refused',))

Solution: if there is no follow up alert, no action is required since the restart is done on purpose (see wiki)

Date: 2022-01-01
Host: a lot
Problem: Juggling with dates. Instead of using deltas it was a mix of missing leading zeros 'W' instead of 'WW' and a combination of current year and last week like '202252'. As a result, views did not contain any data or wrong data.
Solution: use proper PSQL functions instead of hard coding / manual combinations.

Date: 2022-01-13
Host: dev01 / psql
Component: osa sim
Problem: The product 5176524#withfeed_de_feed has no int score, but osa sim calls return candidates. The problem is that int scores are using the new cluster scheme, while the sim calls do not not. And with the clustering some top-k results won't be returned.
Solution: none for the moment. without the new clustering queries are slow, but with them, some nearest neighbors won't be found.

Date: 2022-01-13
Host: Corpex-Frontends
Component: V3
Problem: We get a lot of requests (probably due to a Newsletter from witt). We get a lot of requests from 66.249.81.99 User-Agent: “Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)”
Hint: tail -f /var/log/apache2/access-frontend01.picalike.corpex-kunden.de.log, vimail-stuff is synced from frontend04-hpc to the other frontend[01-03]-hpc
Solution: None

Date: 2022-01-14
Host: dev02
Component: Mongo / Feed Import / Top Trends / psql 02 live
Problem: for at least two runs, no shop data could be read from the MongoDB. It is unclear what happened and where.
Solution: None

Date: 2022-01-19
Host: pci01
Component: PCI01 Server overall, noticed through Feed Import
Problem: the storage was full (Feed Import Alert said “ feed_import_import failed for 'universal_at_feed' session '…': <class 'asyncpg.exceptions.DiskFullError'>”) → problem was the /mnt/storage/ directory which hosts both docker and the upci service. during inserting data into the feed import postgres, the storage went full.
Hints: upci operations: upci
Solution: docker system prune to immediately get some space, then reached out to corpex for more disk space. but the main problem was that the upci logging files grew to large, probably a problem with the log rotation. Still open: limit the log size in the service and truncate the logs.

Date: 2022-01-19
Host: pci01
Component: Feed Import (export), Mapping Service
Problem: too many calls of the mapping endpoints to insert genders (also possible with brands or categories) lead to a blocking of the mapping service (in this case 300k 'different' genders for one shop). If other shops then get imported they can not be finished as they wait and at some point raise asyncio.exceptions.TimeoutError. the blocking shop at some points raises aiohttp.client_exceptions.ServerDisconnectedError.
Solution: We had to restart the mapping service to work again (~/docker_bin/mapping-service/ice_tea_run.sh). Still open: fixing the problem in the import / mapping service

Date: 2022-02-03
Host: sandy:8042
Component: Image Picker
Problem: in some cases the health check will block → timeout, log alert
Solution: none right now, but the service is still working beside the blocking

2022-02-01

Host: i03, tegra boards
Component: v3 feed import, v4 feature extraction
Problem: v4 feature extractor got stuck… on index03

python /mnt/storage/var/live/indexer/scripts/update/feedUpdateList.py /mnt/storage/var/etc/v3/feedUpdate.json

<HTML><p></HTML>will return a list where the V4 field no longer decreases, e.g.:<HTML></p></HTML>

c30d54b60ac625b74e2bb0c9ae2662b2 - 376, 0 of 5 packages pending (V4: 3381 urls to go), 01-Feb 07:22, export started: None

the number 3381 will never go down and there are more feeds queued waiting for the previous feed to finish

Solution: one or all of the required services got stuck and needs to be restarted… possible culprits are:

gpu_extractor on tegra boards

maybe fs is read-only → reboot as user ubuntu, restart service as user picalike (use command history)
disk full → kill process, rm log, restart service (takes a while))
only start shape extractor or color extractor except on t1of1, use /mnt/storage/var/etc/v3/kpo_iterware.conf to check which service needs to run
services on worker01:

check age of log files, if too old maybe restart the service (TODO.restart)
kpo_collector on index04

make sure the redis_export is running
restart the kpo_collector (make sure the kpo_collector finds the redis_exporter in the log file… zeroconf)
restart all feed imports that have waiting features (cancel, retrigger update)

Date: 2021-02-24
Host: sandy / netcup
Component: v5 features
Script: image_picker/missing_worker.py
Problem: netcup services responded with 500 since the postgresql connection could not be recovered
Solution: restarted the docker contains

Date: 2022-03-11
Host: pci01
Component: v5 feed import (export) & Shop Conveyor Belt
Problem: (many) shops are hanging in the feed_import_export stage for a long time (which can be seen in the SCB)
Solution: restarted the docker containers via /home/picalike/docker_bin/feed-import/run.sh. Then, for each shop: (1) in the SCB set busy state to false (2) go to http://pci01.picalike.corpex-kunden.de:1337/docs and use the /delete_shop_session endpoint using the session from the SCB and setting delete_stats to true (3) restart the shop from SCB (all stages starting with feed_import_import)

Date: 2022-03-22
Host: dev02
Component: v5_extractor
Problem: batches are grouped by image URLs and at least one image URL is present ~2,000 which leads to a batch size that exceeds the available memory.
Solution: For now, those blocks are not processed which is no real solution but does not abort the extraction

Date: 2022-03-22
Host: psql02 (netcup)
Component: postgresql
Problem: the simshot matview grow larger and larger (20 GB → 272 GB)
Solution: vacuum did not help, so it was created again drop + create (while taking care that there are no transactions using this table)

Date: 2022-05-19 (?)
Host: report-db01 → netcup, sandy
Component: postgresql (?)
Problem: all connections to the feature DB were 'lost'. This broke the missing_worker.py on sandy and all the worker (v5_image_picker) nodes.
Solution: restart of all docker contains plus the script

Date: 2022-05-25
Host: v3 frontends
Component: lilly
Problem: '2022-05-25T12:56 UWSGI CRITICAL: could not connect() to workers Connection refused'. it seemed that partly the connection to corpex was unreliable, but the VPN service was started. The containers also had a 'created' timestamp of ~10 min. auto restart by 'mini_health_check.sh' (sg01)
Solution restarted docker

Date: 2022-04-13 (detected: 2022-05-30!)
Host: index04
Component: redis_export
Problem: after an exception 'ResponseError: OOM command not allowed when used memory > 'maxmemory' File “/home/picalike/.local/lib/python2.7/site-packages/collector/service.py”, line 125, in run missing = self.proc.push([item]) no new log entries were written. the process is likely in an error state
Solution: restarted service according to TODO.restart, after killing it

Date: 2022-07-26
Host: report-engine
Component: report-engine
Problem: http://report-engine.picalike.corpex-kunden.de:4545/osa_middleware/reports/get_osa_user_settings seems to hang. restarting the database worked last time. try to find the actual cause! – greetings from osa-prelive.picalike.corpex-kunden.de
Solution: Not sure …

Date: 2022-08-03
Host: image-cloud(s)
Component: image-cloud
Problem: log alert 'domain for https://images.goertz.de/is/image/Goertzmedia/TOMMY-JEANS-TJM-Heritage-Guertel-schwarz~99503001~back~ADS-HB.jpg blocked using tiny proxy. please manually analyze and adapt SHIFTER_DOMAINS if needed' … actually the image is not available 'illegal image size' and 403 should be actually 404.
Solution: no(!) need to add it to shifter, since other images work as expected

Date: 2022-08-15
Host: sandy
Component: deit features (extractor)
Problem: The feature DB container has been restarted, likely by an sys update and some services did not recover 'psycopg2.errors.AdminShutdown: terminating connection due to administrator command'. An alert is usually triggered with port 2000*.
Solution: Check that the DB is up and running, then restart the missing_worker.py and the more painful task to restart all workers worker nodes

Date: 2022-09-01
Host: ic*
Component: image cloud
Problem: For the domain image1.lacoste.com we have various errors. some images are blocked (403) but some lead too bad request (400) while some can be normally downloaded (200). and even after the blocking of some images, some still can be downloaded.
Solution: Using shifter if all requests are blocked

Date: 2022-09-05
Host: netcup servers
Component: missing_worker, v5_image_picker (worker)
Problem: the mongo connection was lost and all workers 'crashed': [sandy.picalike.corpex-kunden.de]: http://localhost:10007: health check failed with code 500
Solution: docker restart v5_image_picker on all front-ends. check sandy:~/image_picker/nohup.out if there are still 500 errors, kill and restart (see TODO_restart)

Date: 2022-09-12
Host: dev01, sandy
Component: v5_image_picker, color_extractor
Problem: color extractor relies on person detection and the service (v5_image_picker) was down and thus health generated 'not ok' because of the 500 response. why the 500 response … DB restart and DB connection was not reconnected.
Solution: None yet. A restart of the image picker solved the problem, but is no solution!

Date: 2022-09-15
Host: sandy
Component: unclear / system
Problem: the ssh port forwarding 1000-10005 were somehow terminated and an alert for 20000-20005 (socat public port) were triggered
Solution: restarted ssh connections ~/image_picker/start_missing_worker.sh

Date: 2022-09-28 (2022-09-30 detected)
Host: index04
Component redis_export
Problem: redis_export.log was too old, the reason was an exception “ResponseError: OOM command not allowed when used memory > 'maxmemory'”
Solution: kill process and restart according to TODO restart

Date: 2022-09-29 (2022-09-30)
Host: sg01 (?)
Component: SCB
Problem: all madeleine shops were hanging at the mapping service. the scheduling seemed broken, since the mapping service worked.
Solution: set busy to false and restarted all services in the shop conveyor belt

Date: 2022-10-11
Host: pci
Component: feed_import_export
Problem: feed_import_export failed for 'zalando_de_crawler' session '4164411c17d548ff8b42e30c9a9bd1bb': <class 'asyncio.exceptions.TimeoutError'>
Solution: check that busy is false and restarted all services from feed import export on (no QA …)

Date: 2022-10-11
Host: pci01
Component: mapping_service
Problem: mapping service [2022-10-11 15:01: INFO/USER] get_category_lookup_v3 - loaded 4846 picalike categories in 815770 ms … 14 mins instead of the usual ~100 msecs.
Solution: None so far. Also the feed import timed-out twice

Date: 2022-10-13
Host: pci01
Component: feed_import
Problem: the mongo insert into the history / metadb about 15 min per batch and speed never increased
Solution: if possible wait for the other import/exports to finish, then restart the container and run swiss_army_knife/v5/restart_feed_import_export.py to restart the pending shops.

Date: 2022-10-13
Host: pci01
Component: feed_import, mapping_service
Problem: the feed import sends single requests for brand/gender/category which overloaded the mapping service. due to time-outs zalando constantly failed to import.
Solution: Both components were modified to allow batching.

Date: 2022-11-14
Host: v22019026221283999.supersrv.de
Component: tinyproxy
Problem: logfile contained binary stuff
Solution: restart

The following shops are older than 2 days in the goldmaster lilly

goldmaster lilly Host: sg02
common offending shops, where the alert can be ignored, if it only occurs once or twice:

2220: test feed from sheego… the update gets triggered externally (not every day)

SCB: <shop_id> <update_id>: unable to start feed_import_export

Date: 2022-10-25
Host: pci01 v5-feed-import
Problem: shop-conveyor-belt indicates that the feed-import-export stage is in “waiting” state
Solution: restart all stages from feed-import-import again

all docker containers restarted

Date: any
Host: any
Problem: all docker containers on a machine have been restarted
Solution: unattended-updates upgraded were performed for the docker daemon and it was restarted. check /var/log/unattended-upgrades/unattended-upgrades-dpkg.log and search for docker

simtwins

Date: 2022-11-18
Host: dev02
Problem: [simtwins] no recent .st file, last is from 2022-11-16 13:54 - check update_simtwins.sh
Solution: check if crontab is enabled: 0 9,16 * * * /home/picalike/v5/scripts/update_simtwins.sh

product twins import crashed

Date: 2022-12-07
Host: report-db01
Problem: [report-db01.picalike.corpex-kunden.de] product twins: importing about_you_de_crawler crashed: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Hint: unattended upgrades of docker related stuff leads to restart of containers
Solution: retrigger the import for the affected shop:

on report-db01: check if process_shop.py is still running → kill
restart process_shop.py in background (check crontab -l for command line)
check the docker container that there are shops being processed in the logs

docker 'network' was unreachable

Date: 2023-02-06
Host: dev01
Problem: no network connection from outside to inside docker was possible 'curl: (56) Recv failure: Connection reset by peer'. this seems to be a bug in docker v18
Solution: corpex was contacted. a restart of the daemon did not help, but a restart of the whole M seemed to fix the problem. The VM needs a debian update to avoid further problems.

image cloud fire+forget: DB timeout

Date: 2022-12-20
Host: ic01
Problem: The number of clone requests were growing and we got a slack alert. The reason why this happened was a database time-out (status code 408) visible in the docker logs. We suspect an asyncio problem.
Solution: We restarted the docker container. This fixed the problem for the moment, but the reasony why this happened is not clear.

sync services containers down

Date: 2022-12-20
Host: report-engine
Problem: Due to an unintended update on the machine, the containers was took down and didn't restart, halting the SCB process pipeline.
Solution: To restart the containers at once, is to run the CI/CD of the project on Gitlab, it deploys the container and let than run again. After this, you need to run the script restart_scb.py on the OSA project, than the services that was running/waiting go back again on the containers.

Table of Contents