This application is used to calculate cluster of similar looking products based on the image features extracted from a given reference product. With OSA Cluster you can generate clusters based on products from trendsetter or competitor shops.
the results are stored for faster access. Recalculations happen if:
<HTML><ol></HTML>
Maintainer: Björn Zessack
Calculates the similarity of a reference product to its competitor shops.
Input Params
auto_mode: bool = False force_calc: bool = Schema(False, description="force recalculation of report") no_calc: bool = Schema(False, description="do not recalculate the report, just return what is stored in the mongo (implicitly set for slim_comp_cluster and slim_trend_cluster)") shop_id: str prod_id: Optional[str] picalike_id: Optional[str] = Schema(None, description="automatically set, if prod_id is provided") image: Optional[str] upload_price: Optional[int] = Schema(None, alias="price") limit: int = config.MAX_CLUSTER_SIZE sort: str = "" alpha: float = Schema(0.6, description="weight for trend forecast") beta: float = Schema(0.6, description="weight for trend forecast") # THIS VALUE IS ALSO A DEFAULT IN PHP, IF YOU CHANGE IT HERE, CHANGE IT THERE AS WELL! f2: float = Schema(0.74, alias="cluster_dist") report_id: Optional[str] cluster_date_range: Optional[str] = Schema("", description='format: "12/31/2019 - 12/31/2020"', example="12/31/2017 - 12/31/2050") cluster_categories: Union[List[str], str] = "" cluster_brands: Union[List[str], str] = "" cluster_attributes: Union[List[str], str] = "" cluster_material: Union[List[str], str] = "" cluster_genders: Union[List[str], str] = "" cluster_price_from: Union[str, None] = Schema("", description="format: 23,32") cluster_price_till: Union[str, None] = Schema("", description="format: 123,32") cluster_trendsetters: List[str] = None cluster_competitors: List[str] = None start: int = 0 rows: Union[int, str] = Schema(config.MAX_CLUSTER_SIZE, description="special value: 'all' for all results")
Output:
shop_id: str prod_id: Optional[str] price_reco: PriceReco cluster_trend: Optional[float] cluster: List[ProductOut] cluster_stats: ClusterStats
With the given params the api generates the cluster for a product and return all relevant information for a cluster. If auto mode == True:
It will check in mongo if a recent and valid cluster is available. If not, it calcs and stores a new cluster. See “def build_cluster” for description of cluster_params. The calculation begins with downloading the image features from solr or extracting it from the given picture. With the given features as reference the API creates a request asking for the products with similar features in the solr database. The results will be then cached in the OSA Mongo DB in the collection “reports” –> “comp_reports” / “trend_reports”
Returns a recommendation based on some price information within the cluster. This is how the reco price is calculated:
mean_price = params["mean_price"] ref_price = int(params["ref_price"])
def __calc_prices(f1, f2): p1 = ref_price * f1 p2 = ref_price * f2 p1, p2 = _adjust_prices(p1, p2) m1 = (p1 / ref_price - 1.0) * 100 m2 = (p2 / ref_price - 1.0) * 100 result["price_range"] = [p1, p2] result["margin_range"] = [m1, m2]
current_diff = 1.0 - ref_price / mean_price if current_diff < 0.2: result["uncertain"] = True elif 0.2 < current_diff < 0.4: __calc_prices(1.05, 1.2) elif 0.4 < current_diff < 0.7: __calc_prices(1.15, 1.45) elif 0.7 < current_diff < 0.9: __calc_prices(1.35, 1.75) elif current_diff > 0.9: p1 = ref_price * 2 p1, p2 = _adjust_prices(p1, p1) m1 = (p1 / ref_price - 1.0) * 100 m2 = (p2 / ref_price - 1.0) * 100 result["price_range"] = [p1, p2] result["margin_range"] = [m1, m2] result["huge_markup"] = True return results
Note: The function “_adjust_prices” is basically a look up in the OSA Mongo [“osa_db”][“realistic_prices”] to get manually defined “realistic prices”. These constants seen above are based on empirical experience. To ask more about it, please contact Björn or Sebastian.
It will check in mongo if a recent and valid cluster is available. If not, it calcs and stores a new cluster. See build_cluster for description of cluster_params.
Input: cluster WITHOUT the reference
Output: cluster trend as float
The API uses the product trends and similarities to calculate the cluster trend. It was used as a wrapper for calc_cluster_trend_with_similarity (Björns Comment)
Calculation Steps roughly:
similarity_maximum_distance = max(inp.similarity_scores) keep_values_ratio = 0.8 product_scores = np.array(inp.product_scores, dtype=np.float32) similarity_scores = np.array(inp.similarity_scores, dtype=np.float32)
similarity_scores = 1 - (similarity_scores / similarity_maximum_distance)
idx = np.argsort(similarity_scores) s, p = similarity_scores[idx], product_scores[idx] border1 = int(len(s) * (1 - keep_values_ratio) + 0.5) p = p[border1:]
cluster_trend = float(np.mean(p))
Calculates the similarity of a reference product to its trendsetter shops.
Input and Output require the same fields as “/comp_cluster”.
The calculations are basically the same as with “/comp_cluster”, the only difference is that the cluster are calculated based on the trendsetter shops instead of its competitors.
Returns a slim version of comp_cluster that is used in overlap calculations. The results will not be stored in the Mongo DB. The calculation steps are the same as “/comp_cluster”.
Input:
Output:
shop_id: str prod_id: str picalike_id: str picalike_cat: List[str] = Schema( ..., description="picalike category of the reference article" ) brand: str = Schema( ..., description="brand of the reference article" ) cluster: List[SlimClusterEntry] = Schema( [], description="contains only data for competitor products. (does not contain the reference product)" )
Returns a slim version of trend_cluster that is used in overlap calculations. The results will not be stored in the Mongo DB. The calculation steps are the same as “/trend_cluster”.
Input and Output require the same fields as “/slim_comp_cluster”.
Returns a list of shop_ids that have cluster_reports
Directly downloads data from live mongo –> db: “reports” - collection: “comp_reports”
Returns a ClusterReport (including the price reco and cluster trend). It does not calculate any data, it looks for cached results in the Mongo DB in the OSA DATA –> [“reports”][“price_reco”] & [“reports”][“trend_indicator”].
Input:
report_id: str
Output:
report_id: str price_reco: PriceReco price_reco_date: datetime cluster_trend: Optional[float] cluster_trend_history: List[TrendHistoryEntry] cluster_trend_date: datetime comp_cluster_stats: ClusterStats trend_cluster_stats: ClusterStats
Returns a ClusterReport for a report_id. Its like “/get_cluster_report”, but values are calculated if needed. Calculation steps are the same as “/trend_cluster”.
This API is also called by solr updater in get_cluster_info.
Input:
report_id: str reference: Optional[SimProduct] reference_features: List[float] reference_sortmax: List[int] force_calc: bool = Schema( False, description="NotImplemented" )
Output:
report_id: str price_reco: PriceReco price_reco_date: datetime cluster_trend: Optional[float] cluster_trend_history: List[TrendHistoryEntry] cluster_trend_date: datetime comp_cluster_stats: ClusterStats trend_cluster_stats: ClusterStats
Returns a Solr Report containing clusters, trend score and price recommendation to be added to a Solr document.
This API is called by solr updater in get_cluster_data for products that have passed the category and gender checks and do not have a fresh cluster in the solr_reports collection. The API calculates the competitors and the trend cluster (by calling similarity_api), the cluster trend score and the price recommendation. Each of those is saved in their respective collection in mongo. The information needed for the solr document is summarized in another report that is saved in the solr_reports collection, which contains at most document for each product, and is returned to the solr updater.
Input:
reference: dict session: int
reference must contain the following fields:
picalike_id: only reference id: product id shop_id: shop id img: main image url location: deeplink, opt price: price (int) name: opt brand: opt picalike_cat: list picalike_gender: list w: 0, opt features: list sortmax: list
Output:
solr_report (dict)
solr_report contains:
price_diff_int margin_float price_reco_date_dt cluster_trend_date_dt cluster_trend_float comp_cluster_size_int trend_cluster_size_int comp_cluster_keywords_multitext trend_cluster_keywords_multitext comp_cluster_attributes_multitext trend_cluster_attributes_multitext comp_cluster_unique_competitors_multitext trend_cluster_unique_competitors_multitext