How Geniebot matches product to policy using Natural Language Processing

Lihan Li

Engineering Manager, Data & Analytics

Intro to Geniebot

Geniebot finds the appropriate insurance policy for your e-commerce store by analyzing the product metadata and using an advanced Natural Language Processing (NLP) model to match thousands of product titles to the appropriate product categories and enriches other transactional data to return XCover Protection in real time.

There are several applications for Geniebot’s real-time enrichment:

  1. For our partners in retail and those who host checkouts, given a product title “Samsung Galaxy A13 16GB MEM 256GB”, Geniebot produces the classification of “Electronics –> Phones –> Mobile Phones”.
  2. In another example, for fintech and banking apps who monitor transactions, Geniebot can additionally receive MCCs and categorize them into insurable transactions, for instance 3000 (United), 3012 (Qantas), 4411 (cruises) returns XCover travel and cruise protection, 3357 (Hertz) and 3389 (Avis) return RentalCover protection, 6513 (real estate agents) returns property insurance (homeowners, renters, title and others, depending on associated charges e.g. various contractor-related codes), 0742 (vets) returns pet insurance and 5533 (auto parts) returns auto warranties etc (there’s many more).

 

Additionally, LatLong, MerchantName and Website, plus Item Descriptor, Quantity, Unit Cost, Product Codes, Taxes, Shipping Costs and ‒ where provided by travel merchants ‒ itinerary and ancillary data all help Geniebot confirm the insurability of a merchant-level transaction and return a prefilled quote in real time to any of our partner’s customers worldwide.

Why product categorization is challenging

Firstly, human languages are characterized by a great deal of ambiguity and imprecision. The same word, phrase, or sentence can have different meanings depending on context and order. In addition, product titles use blended words not found in dictionaries. Consumers are naturally able to understand these ambiguities, but it’s harder for machines to parse them.

Secondly, there are a large number of product categories (Google Shopping lists over 5K categories) to select from. Conventional binary classifiers or multi-class classifiers don’t perform as well as they don’t scale nearly close to the required number of product categories.

Thirdly, it’s slow and expensive to train and iterate models because of the high number of product categories and training text size, often measuring over 10GB. 

What is Natural Language Processing

Natural Language Processing, or NLP, is the name of a subfield of Artificial Intelligence concerned with interpreting and understanding human languages. 

A typical NLP pipeline includes data pre-processing (tokenization, lemmatization, stemming, etc..) and a high-dimensional numerical representation for a given word token. Conventionally, words with similar meanings should have a small distance between them.

Modern NLP approaches

Geniebot uses the state-of-the-art transformer-based neural network architecture that consists of a bi-directional encoder and an autoregressive decoder, with nearly 150 million trainable parameters combined. The encoder learns the product title as a whole (sentence-level representation) and it embeds the meaning context for each word. Both are combined to create the final embedding vector for the product title.

The second part of the model applies the embedding output from the encoder to map to target product categories. The decoder was trained to produce the word category text. Although we trained using next-sentence or token prediction methods, we did not achieve satisfactory results and therefore switched to an autoregressive architecture like GPT for our decoder.  

For the downstream classification task, Geniebot, instead of creating many one-VS-rest classifiers (in this case, it would be thousands) or one multi-class classifier (which is not feasible given the class size), only used a single binary classifier to achieve the objective. We used negative sampling to transform our dataset from category prediction to category matching: “1” for the correct category, and “0” for the other case.

Category Tree Pruning

Product categories typically form a hierarchy. If a product is in the “Electronics” category, it cannot be put in any other category that is of the same level (e.g., “Pet Supplies”).

At the time of inference, we use this rule to remove unnecessary product categories to which the product does not belong, which speeds up our model run time logarithmically.

Conclusion

With the above efforts combined (encoder-decoder architecture, binary transformation, category tree pruning), Geniebot can produce the product category from the product title in under 60 milliseconds (P50) with the use of a single deep learning model.