Guidelines for Input Data
If you were to classify two products as a match, what data would you need?
Would this product description be enough?
CREST PRO HEALTH MINT 5 OZ
For some products, it might be, but for Crest pro health there are two flavors "clean mint" or "mint burst" that could cause ambiguity.
Ambiguity and certainty go hand in hand. All matches are classified with "high" or "medium" certainty. If the data is ambiguous, meaning it could refer to two or more products, then the certainty score will usually reflect that.
Data quality
Generally speaking, if a person can differentiate between similar products, your data is of sufficient quality. Additionally, abbreviations, odd characters, and formatting issues are generally not a problem, as long as a human being can understand what the data says.
Image vs text matching
Ideally, one would use image and text data from both product sources. Both data points are important for matching, but image matching is prohibitive for high volume use cases, and often not available for products sourced from wholesale suppliers.
Without an image, text-only language models with extensive product knowledge can still provide accurate matches at high volume, especially when supplied with quality data. Product codes also go a long way to improve match certainty.
Data Guidelines
Clean, descriptive data leads to good AI output.
A human should be able to differentiate between similar products based on the data.
Pay attention to the "certainty" score. Ambiguous data leads to a higher false positive rate, but we offset that risk by showing the certainty of a match as "high" or "medium."
Code: Product codes are not required to search, but as many code types as you have access to are strongly recommended for clarifying exactly what product model / variant is being described. Helpful codes include UPC, GTIN, EAN, and manufacturer product codes. Please provide all available code types.
Use key value pairs whenever you are adding a column to a product: remember that we cannot see the column that data came from unless you provide it as a key value pair. All unfamiliar column names will go into the description as key:value pairs, and are helpful. For example, you can send a product object as follows if these were your columns {"title": "this is a title", "manufacturer": "P&G", "additional codes": 53496}.
Branded vs. generic products: Our product matching models are designed to match branded products. Generic products from different manufacturers will likely be classified as not a match.
Bad columns: columns with supplier-specific data such as quantity and lead times are not needed, and sometimes can be misleading. For example, "quantity: 10" could appear to be a pack quantity instead of a stock quantity. As a rule of thumb: keep any columns that have anything to do with what the product is and what configuration it is in, and drop any data that would never be on an ecommerce listing, such as lead time.
Last updated