====== OSA Cluster ======

This application is used to calculate cluster of similar looking products based on the image features extracted from a given reference product. With OSA Cluster you can generate clusters based on products from trendsetter or competitor shops.

the results are stored for faster access. Recalculations happen if:

<HTML><ol></HTML>
  * the stored report is older than 7 days
  * the price of the reference product changed
  * the items in the excluded list changed<HTML></ol></HTML>

Maintainer: Björn Zessack


===== How to access the Api? =====

Host: http://pci01.picalike.corpex-kunden.de:8009


===== Dependencies =====


==== Solr ====


==== OSA Mongo ====


=== Collections: ===


  * reports - trend_indicator
  * reports - trend_reports
  * reports - price_reco
  * reports - comp_reports
  * reports - solr_reports


==== OSA Mongo Data ====


===== Endpoints =====


==== "/comp_cluster" [POST] ====

Calculates the similarity of a reference product to its competitor shops.

**Input Params**

<code>
  auto_mode: bool = False
  force_calc: bool = Schema(False, description="force recalculation of report")
  no_calc: bool = Schema(False, description="do not recalculate the report, just return what is stored in the mongo (implicitly set for slim_comp_cluster and slim_trend_cluster)")
  shop_id: str
  prod_id: Optional[str]
  picalike_id: Optional[str] = Schema(None, description="automatically set, if prod_id is provided")
  image: Optional[str]
  upload_price: Optional[int] = Schema(None, alias="price")
  limit: int = config.MAX_CLUSTER_SIZE
  sort: str = ""
  alpha: float = Schema(0.6, description="weight for trend forecast")
  beta: float = Schema(0.6, description="weight for trend forecast")
  # THIS VALUE IS ALSO A DEFAULT IN PHP, IF YOU CHANGE IT HERE, CHANGE IT THERE AS WELL!
  f2: float = Schema(0.74, alias="cluster_dist")
  report_id: Optional[str]
  cluster_date_range: Optional[str] = Schema("", description='format: "12/31/2019 - 12/31/2020"', example="12/31/2017 - 12/31/2050")
  cluster_categories: Union[List[str], str] = ""
  cluster_brands: Union[List[str], str] = ""
  cluster_attributes: Union[List[str], str] = ""
  cluster_material: Union[List[str], str] = ""
  cluster_genders: Union[List[str], str] = ""
  cluster_price_from: Union[str, None] = Schema("", description="format: 23,32")
  cluster_price_till: Union[str, None] = Schema("", description="format: 123,32")
  cluster_trendsetters: List[str] = None
  cluster_competitors: List[str] = None
  start: int = 0
  rows: Union[int, str] = Schema(config.MAX_CLUSTER_SIZE, description="special value: 'all' for all results")
</code>
**Output:**

<code>
  shop_id: str
  prod_id: Optional[str]
  price_reco: PriceReco
  cluster_trend: Optional[float]
  cluster: List[ProductOut]
  cluster_stats: ClusterStats
</code>

=== How does the calculation work? ===

With the given params the api generates the cluster for a product and return all relevant information for a cluster. If auto mode == True:


  * get price reco and cluster from cache, calc and store if needed
  * get trend indicator from cache, calc and store if needed


== Building the cluster ==

It will check in mongo if a recent and valid cluster is available. If not, it calcs and stores a new cluster. See “def build_cluster” for description of cluster_params. The calculation begins with downloading the image features from solr or extracting it from the given picture. With the given features as reference the API creates a request asking for the products with similar features in the solr database. The results will be then cached in the OSA Mongo DB in the collection “reports” –> “comp_reports” / “trend_reports”


== Calculation the price recommendation ("calc_price_reco") ==

Returns a recommendation based on some price information within the cluster. This is how the reco price is calculated:

<code>
  mean_price = params["mean_price"]
  ref_price = int(params["ref_price"])
</code>
<code>
  def __calc_prices(f1, f2):
      p1 = ref_price * f1
      p2 = ref_price * f2
      p1, p2 = _adjust_prices(p1, p2)
      m1 = (p1 / ref_price - 1.0) * 100
      m2 = (p2 / ref_price - 1.0) * 100
      result["price_range"] = [p1, p2]
      result["margin_range"] = [m1, m2]
</code>
<code>
  current_diff = 1.0 - ref_price / mean_price
  if current_diff < 0.2:
      result["uncertain"] = True
  elif 0.2 < current_diff < 0.4:
      __calc_prices(1.05, 1.2)
  elif 0.4 < current_diff < 0.7:
      __calc_prices(1.15, 1.45)
  elif 0.7 < current_diff < 0.9:
      __calc_prices(1.35, 1.75)
  elif current_diff > 0.9:
      p1 = ref_price * 2
      p1, p2 = _adjust_prices(p1, p1)
      m1 = (p1 / ref_price - 1.0) * 100
      m2 = (p2 / ref_price - 1.0) * 100
      result["price_range"] = [p1, p2]
      result["margin_range"] = [m1, m2]
      result["huge_markup"] = True
      
  return results    
</code>
Note: The function “_adjust_prices” is basically a look up in the OSA Mongo [“osa_db”][“realistic_prices”] to get manually defined “realistic prices”. These constants seen above are based on empirical experience. To ask more about it, please contact Björn or Sebastian.


== Calculating the trend score ==

It will check in mongo if a recent and valid cluster is available. If not, it calcs and stores a new cluster. See build_cluster for description of cluster_params.

Input: cluster WITHOUT the reference

Output: cluster trend as float

The API uses the product trends and similarities to calculate the cluster trend. It was used as a wrapper for calc_cluster_trend_with_similarity (Björns Comment)

Calculation Steps roughly:


  * get product scores and similarity scores

<code>
  similarity_maximum_distance = max(inp.similarity_scores)
  keep_values_ratio = 0.8
  
  product_scores = np.array(inp.product_scores, dtype=np.float32)
  similarity_scores = np.array(inp.similarity_scores, dtype=np.float32)
</code>

  * rescale similarity scores

<code>
  similarity_scores = 1 - (similarity_scores / similarity_maximum_distance)
</code>

  * take only 80% most similar products as relevant

<code>
  idx = np.argsort(similarity_scores)
  s, p = similarity_scores[idx], product_scores[idx]
  border1 = int(len(s) * (1 - keep_values_ratio) + 0.5)
  p = p[border1:]
</code>

  * normalize cluster_trend

<code>
  cluster_trend = float(np.mean(p))
</code>

  * return cluster_trend float score


==== "/trend_cluster" [POST] ====

Calculates the similarity of a reference product to its trendsetter shops.

Input and Output require the same fields as “/comp_cluster”.

The calculations are basically the same as with “/comp_cluster”, the only difference is that the cluster are calculated based on the trendsetter shops instead of its competitors.


==== "/slim_comp_cluster" [GET] ====

Returns a slim version of comp_cluster that is used in overlap calculations. The results will not be stored in the Mongo DB. The calculation steps are the same as “/comp_cluster”.

Input:


  * shop_id
  * prod_id

Output:

<code>
  shop_id: str
  prod_id: str
  picalike_id: str
  picalike_cat: List[str] = Schema(
      ...,
      description="picalike category of the reference article"
  )
  brand: str = Schema(
      ...,
      description="brand of the reference article"
  )
  cluster: List[SlimClusterEntry] = Schema(
      [],
      description="contains only data for competitor products. (does not contain the reference product)"
  )
</code>

==== "/slim_trend_cluster" [GET] ====

Returns a slim version of trend_cluster that is used in overlap calculations. The results will not be stored in the Mongo DB. The calculation steps are the same as “/trend_cluster”.

Input and Output require the same fields as “/slim_comp_cluster”.


==== "/shop_ids" [GET] ====

Returns a list of shop_ids that have cluster_reports

Directly downloads data from live mongo –> db: “reports” - collection: “comp_reports”


==== "/get_cluster_report" [POST] ====

Returns a ClusterReport (including the price reco and cluster trend). It does not calculate any data, it looks for cached results in the Mongo DB in the OSA DATA –> [“reports”][“price_reco”] & [“reports”][“trend_indicator”].

Input:

<code>
  report_id: str
</code>
Output:

<code>
  report_id: str
  price_reco: PriceReco
  price_reco_date: datetime
  cluster_trend: Optional[float]
  cluster_trend_history: List[TrendHistoryEntry]
  cluster_trend_date: datetime
  comp_cluster_stats: ClusterStats
  trend_cluster_stats: ClusterStats
</code>

==== "/calc_report" [POST] ====

Returns a ClusterReport for a report_id. Its like “/get_cluster_report”, but values are calculated if needed. Calculation steps are the same as “/trend_cluster”.

This API is also called by solr updater in get_cluster_info.

Input:

<code>
  report_id: str
  reference: Optional[SimProduct]
  reference_features: List[float]
  reference_sortmax: List[int]
  force_calc: bool = Schema(
      False,
      description="NotImplemented"
  )
</code>
Output:

<code>
  report_id: str
  price_reco: PriceReco
  price_reco_date: datetime
  cluster_trend: Optional[float]
  cluster_trend_history: List[TrendHistoryEntry]
  cluster_trend_date: datetime
  comp_cluster_stats: ClusterStats
  trend_cluster_stats: ClusterStats
  
</code>

==== "/calc_single_solr" [POST] ====

Returns a Solr Report containing clusters, trend score and price recommendation to be added to a Solr document.

This API is called by solr updater in get_cluster_data for products that have passed the category and gender checks and do not have a fresh cluster in the solr_reports collection. The API calculates the competitors and the trend cluster (by calling similarity_api), the cluster trend score and the price recommendation. Each of those is saved in their respective collection in mongo. The information needed for the solr document is summarized in another report that is saved in the solr_reports collection, which contains at most document for each product, and is returned to the solr updater.

Input:

<code>
  reference: dict
  session: int
  
</code>
reference must contain the following fields:

<code>
  picalike_id: only reference
  id: product id
  shop_id: shop id 
  img: main image url
  location: deeplink, opt
  price: price (int)
  name: opt
  brand: opt
  picalike_cat: list
  picalike_gender: list
  w: 0, opt
  features: list
  sortmax: list
</code>
Output:

<code>
  solr_report (dict)
  
</code>
solr_report contains:

<code>
  price_diff_int
  margin_float
  price_reco_date_dt
  cluster_trend_date_dt
  cluster_trend_float
  comp_cluster_size_int
  trend_cluster_size_int
  comp_cluster_keywords_multitext
  trend_cluster_keywords_multitext
  comp_cluster_attributes_multitext
  trend_cluster_attributes_multitext
  comp_cluster_unique_competitors_multitext
  trend_cluster_unique_competitors_multitext
</code>