====== OSA Cluster ====== This application is used to calculate cluster of similar looking products based on the image features extracted from a given reference product. With OSA Cluster you can generate clusters based on products from trendsetter or competitor shops. the results are stored for faster access. Recalculations happen if:
auto_mode: bool = False
force_calc: bool = Schema(False, description="force recalculation of report")
no_calc: bool = Schema(False, description="do not recalculate the report, just return what is stored in the mongo (implicitly set for slim_comp_cluster and slim_trend_cluster)")
shop_id: str
prod_id: Optional[str]
picalike_id: Optional[str] = Schema(None, description="automatically set, if prod_id is provided")
image: Optional[str]
upload_price: Optional[int] = Schema(None, alias="price")
limit: int = config.MAX_CLUSTER_SIZE
sort: str = ""
alpha: float = Schema(0.6, description="weight for trend forecast")
beta: float = Schema(0.6, description="weight for trend forecast")
# THIS VALUE IS ALSO A DEFAULT IN PHP, IF YOU CHANGE IT HERE, CHANGE IT THERE AS WELL!
f2: float = Schema(0.74, alias="cluster_dist")
report_id: Optional[str]
cluster_date_range: Optional[str] = Schema("", description='format: "12/31/2019 - 12/31/2020"', example="12/31/2017 - 12/31/2050")
cluster_categories: Union[List[str], str] = ""
cluster_brands: Union[List[str], str] = ""
cluster_attributes: Union[List[str], str] = ""
cluster_material: Union[List[str], str] = ""
cluster_genders: Union[List[str], str] = ""
cluster_price_from: Union[str, None] = Schema("", description="format: 23,32")
cluster_price_till: Union[str, None] = Schema("", description="format: 123,32")
cluster_trendsetters: List[str] = None
cluster_competitors: List[str] = None
start: int = 0
rows: Union[int, str] = Schema(config.MAX_CLUSTER_SIZE, description="special value: 'all' for all results")
**Output:**
shop_id: str
prod_id: Optional[str]
price_reco: PriceReco
cluster_trend: Optional[float]
cluster: List[ProductOut]
cluster_stats: ClusterStats
=== How does the calculation work? ===
With the given params the api generates the cluster for a product and return all relevant information for a cluster. If auto mode == True:
* get price reco and cluster from cache, calc and store if needed
* get trend indicator from cache, calc and store if needed
== Building the cluster ==
It will check in mongo if a recent and valid cluster is available. If not, it calcs and stores a new cluster. See “def build_cluster” for description of cluster_params. The calculation begins with downloading the image features from solr or extracting it from the given picture. With the given features as reference the API creates a request asking for the products with similar features in the solr database. The results will be then cached in the OSA Mongo DB in the collection “reports” –> “comp_reports” / “trend_reports”
== Calculation the price recommendation ("calc_price_reco") ==
Returns a recommendation based on some price information within the cluster. This is how the reco price is calculated:
mean_price = params["mean_price"]
ref_price = int(params["ref_price"])
def __calc_prices(f1, f2):
p1 = ref_price * f1
p2 = ref_price * f2
p1, p2 = _adjust_prices(p1, p2)
m1 = (p1 / ref_price - 1.0) * 100
m2 = (p2 / ref_price - 1.0) * 100
result["price_range"] = [p1, p2]
result["margin_range"] = [m1, m2]
current_diff = 1.0 - ref_price / mean_price
if current_diff < 0.2:
result["uncertain"] = True
elif 0.2 < current_diff < 0.4:
__calc_prices(1.05, 1.2)
elif 0.4 < current_diff < 0.7:
__calc_prices(1.15, 1.45)
elif 0.7 < current_diff < 0.9:
__calc_prices(1.35, 1.75)
elif current_diff > 0.9:
p1 = ref_price * 2
p1, p2 = _adjust_prices(p1, p1)
m1 = (p1 / ref_price - 1.0) * 100
m2 = (p2 / ref_price - 1.0) * 100
result["price_range"] = [p1, p2]
result["margin_range"] = [m1, m2]
result["huge_markup"] = True
return results
Note: The function “_adjust_prices” is basically a look up in the OSA Mongo [“osa_db”][“realistic_prices”] to get manually defined “realistic prices”. These constants seen above are based on empirical experience. To ask more about it, please contact Björn or Sebastian.
== Calculating the trend score ==
It will check in mongo if a recent and valid cluster is available. If not, it calcs and stores a new cluster. See build_cluster for description of cluster_params.
Input: cluster WITHOUT the reference
Output: cluster trend as float
The API uses the product trends and similarities to calculate the cluster trend. It was used as a wrapper for calc_cluster_trend_with_similarity (Björns Comment)
Calculation Steps roughly:
* get product scores and similarity scores
similarity_maximum_distance = max(inp.similarity_scores)
keep_values_ratio = 0.8
product_scores = np.array(inp.product_scores, dtype=np.float32)
similarity_scores = np.array(inp.similarity_scores, dtype=np.float32)
* rescale similarity scores
similarity_scores = 1 - (similarity_scores / similarity_maximum_distance)
* take only 80% most similar products as relevant
idx = np.argsort(similarity_scores)
s, p = similarity_scores[idx], product_scores[idx]
border1 = int(len(s) * (1 - keep_values_ratio) + 0.5)
p = p[border1:]
* normalize cluster_trend
cluster_trend = float(np.mean(p))
* return cluster_trend float score
==== "/trend_cluster" [POST] ====
Calculates the similarity of a reference product to its trendsetter shops.
Input and Output require the same fields as “/comp_cluster”.
The calculations are basically the same as with “/comp_cluster”, the only difference is that the cluster are calculated based on the trendsetter shops instead of its competitors.
==== "/slim_comp_cluster" [GET] ====
Returns a slim version of comp_cluster that is used in overlap calculations. The results will not be stored in the Mongo DB. The calculation steps are the same as “/comp_cluster”.
Input:
* shop_id
* prod_id
Output:
shop_id: str
prod_id: str
picalike_id: str
picalike_cat: List[str] = Schema(
...,
description="picalike category of the reference article"
)
brand: str = Schema(
...,
description="brand of the reference article"
)
cluster: List[SlimClusterEntry] = Schema(
[],
description="contains only data for competitor products. (does not contain the reference product)"
)
==== "/slim_trend_cluster" [GET] ====
Returns a slim version of trend_cluster that is used in overlap calculations. The results will not be stored in the Mongo DB. The calculation steps are the same as “/trend_cluster”.
Input and Output require the same fields as “/slim_comp_cluster”.
==== "/shop_ids" [GET] ====
Returns a list of shop_ids that have cluster_reports
Directly downloads data from live mongo –> db: “reports” - collection: “comp_reports”
==== "/get_cluster_report" [POST] ====
Returns a ClusterReport (including the price reco and cluster trend). It does not calculate any data, it looks for cached results in the Mongo DB in the OSA DATA –> [“reports”][“price_reco”] & [“reports”][“trend_indicator”].
Input:
report_id: str
Output:
report_id: str
price_reco: PriceReco
price_reco_date: datetime
cluster_trend: Optional[float]
cluster_trend_history: List[TrendHistoryEntry]
cluster_trend_date: datetime
comp_cluster_stats: ClusterStats
trend_cluster_stats: ClusterStats
==== "/calc_report" [POST] ====
Returns a ClusterReport for a report_id. Its like “/get_cluster_report”, but values are calculated if needed. Calculation steps are the same as “/trend_cluster”.
This API is also called by solr updater in get_cluster_info.
Input:
report_id: str
reference: Optional[SimProduct]
reference_features: List[float]
reference_sortmax: List[int]
force_calc: bool = Schema(
False,
description="NotImplemented"
)
Output:
report_id: str
price_reco: PriceReco
price_reco_date: datetime
cluster_trend: Optional[float]
cluster_trend_history: List[TrendHistoryEntry]
cluster_trend_date: datetime
comp_cluster_stats: ClusterStats
trend_cluster_stats: ClusterStats
==== "/calc_single_solr" [POST] ====
Returns a Solr Report containing clusters, trend score and price recommendation to be added to a Solr document.
This API is called by solr updater in get_cluster_data for products that have passed the category and gender checks and do not have a fresh cluster in the solr_reports collection. The API calculates the competitors and the trend cluster (by calling similarity_api), the cluster trend score and the price recommendation. Each of those is saved in their respective collection in mongo. The information needed for the solr document is summarized in another report that is saved in the solr_reports collection, which contains at most document for each product, and is returned to the solr updater.
Input:
reference: dict
session: int
reference must contain the following fields:
picalike_id: only reference
id: product id
shop_id: shop id
img: main image url
location: deeplink, opt
price: price (int)
name: opt
brand: opt
picalike_cat: list
picalike_gender: list
w: 0, opt
features: list
sortmax: list
Output:
solr_report (dict)
solr_report contains:
price_diff_int
margin_float
price_reco_date_dt
cluster_trend_date_dt
cluster_trend_float
comp_cluster_size_int
trend_cluster_size_int
comp_cluster_keywords_multitext
trend_cluster_keywords_multitext
comp_cluster_attributes_multitext
trend_cluster_attributes_multitext
comp_cluster_unique_competitors_multitext
trend_cluster_unique_competitors_multitext