====== Automatic Picalike Category Prediction ====== As an additional to the (keyword-based) mapping service, we work on a system to automatically derive picalike category by product names and shop categories. git: https://git.picalike.corpex-kunden.de/incubator/category-predictor ===== Overview ===== The goal is to predict a picalike_category by using the shop categories from a crawler shop [no feed shops!]. This category predictor predicts picalike categories from the raw shop categories from the Krawla and the product names. ====== Remarks and current status ====== Different kinds of models were used. All of them fuse the product name and category tree (from the website) to predict the category. One approach was to predict the entire category sequentially, that is, predicting each category level with a recurrent neural net (RNN) (e.g. clothing → clothing//pants → clothing//pants%%__%%jeans). The other approach was to use an energy-based model, presenting all categories and selecting those with the lowest energy. Overall, the difference between those approaches is not very big. This project was abandoned due to a few issues: - The data used for training comes from the mapping service. At the time, the categories were quite noisy. Therefore, since the model is only as good as its training data, we couldn't improve over the keyword-only mapping service. - Later, many heuristics were incorporated to the mapping service, which improved significantly the quality of category labels. Still, since the models receive as input the save strings that the mapping service uses to perform the mapping, it's very likely that no extra information is learned (besides keywords). - This was part of an effort to improve OSA data quality. With OSA not being actively developed, it doesn't make sense to continue (for now). There is a minimal API implementation, but it's far from optimal as development was halted. Refactoring or rewriting based on new requirements is strongly advised if this project is re-opened.