A Scratchbook

Scratchbooks are jupyter notebooks that I create to work through an idea or a new methodology. They are usually a little half baked, and usually serve as a rough sketch for a larger project.

Searching for a Word in Business Names

In this notebook we will look at how to identify certain type of business, just by the name part. For example, say we want to identify businesses that are likely grocery stores. Simply checking if the business name has grocery in it would be too narrow. We want to find other words that appear in business name data that have similar semantic meaning that we could search for. Having these different words, will allow our search to be more robust.

import csv
import re
import string

import annoy
import spacy
from scipy.spatial.distance import cosine
from spacy.lemmatizer import Lemmatizer

First we will just read in the data, we will just put it in a dictionary.
The data came from the following kaggle dataset:
https://www.kaggle.com/peopledatalabssf/free-7-million-company-dataset
This is 7MM business from around the world.

I should be using something like pandas, but sometimes it’s fun to just use the base python csv package.

with open("companies_sorted.csv", "r") as bnm_csv:
    csv_reader = csv.DictReader(bnm_csv)
    bus_dat = OrderedDict()
    for idx, r in enumerate(csv_reader):
        if idx == 0:
            for col in r:
                bus_dat[col] = [r[col]]
        else:
            for col in r:
                bus_dat[col].append(r[col])
[col_name for col_name in bus_dat]
['',
 'name',
 'domain',
 'year founded',
 'industry',
 'size range',
 'locality',
 'country',
 'linkedin url',
 'current employee estimate',
 'total employee estimate']
for i, j in zip(bus_dat["name"][0:5], bus_dat["domain"][0:5]):
    print(f"{i}, {j}")
ibm, ibm.com
tata consultancy services, tcs.com
accenture, accenture.com
us army, goarmy.com
ey, ey.com

So we only need the first column, which has all of the business words. This is an interesting dataset, it would be fun to explore it in another notebook in the future.

bus_name = bus_dat["name"]
len(bus_name)
7173426

Next, we are going to go through some basic text preprocessing:

  • Split all of the words into the individual tokens.
  • Remove numbers and punctuation.
  • Convert all words to lowercase.

Next, we will go through the list, and split all items on spaces.

bus_parts = [
    word for name_list in [name.split() for name in bus_name] for word in name_list
]

punc_n_nums = string.punctuation + "".join([str(i) for i in range(10)])
# Strip punctuation and numbers
bus_parts = [s.translate(str.maketrans("", "", string.punctuation)) for s in bus_parts]
bus_parts[0:10]
['ibm',
 'tata',
 'consultancy',
 'services',
 'accenture',
 'us',
 'army',
 'ey',
 'hewlettpackard',
 'cognizant']
len(bus_parts)
21455275

Now we are going to dedup the words, we will do this several times throughout the process.

bus_parts = list(set([s.lower() for s in bus_parts if s != ""]))

Process Names with spaCy

Next we will process the texts using spaCy. A few things we want to do.

  • Remove stop words.
  • lemmatize words

This is a very cool package for natural language processing, check out the docs for some more details, (https://spacy.io/).

After this we will dedup the words, and then get the word vectors.

nlp = spacy.load("en_core_web_md")
proc_words = list(set([s for s in bus_parts if s not in nlp.Defaults.stop_words]))

Next we will lemmatize all of the word parts.

lemmatizer = Lemmatizer(nlp.vocab.lookups)

This will lemmatize the word if it’s in the vocab, otherwise just return the word. Lemmatization in natural language processing is the act of normalizing words back to their root form. The following shows some examples of this.

lemmatizer.lookup("going"), lemmatizer.lookup("caring"), lemmatizer.lookup(
    "someCrazyWord"
)
('go', 'care', 'someCrazyWord')

Now we will apply this to all of the words we are working with.

proc_lemma = [lemmatizer.lookup(w) for w in proc_words]
# Drop dupicates
lemma_sub = list(set(proc_lemma))
len(lemma_sub)
2158126

Next, we will go through, and add all of our words, to a dictionary were the lemma is the key, and the word vector is the value.

bwd = {}
for idx, w in enumerate(lemma_sub):
    if nlp.vocab[w].has_vector:
        bwd[w] = (idx, nlp.vocab[w].vector)

Word vectors provide a way for us to represent words in vector space. One of the most common models for creating them is Word2Vec. The closer a word is to another word in vector space, the more similar the meanings are.

def word_cosine_similarity(w1, w2, model):
    return 1 - cosine(model.vocab[w1].vector, model.vocab[w2].vector)

Other text similarity measurements, such as edit distance, or soundex are looking to see if the word has similar spelling. Word vectors consider semantic similarity instead.

word_cosine_similarity("farmer", "framer", nlp)
0.06580065190792084
word_cosine_similarity("farmer", "agriculture", nlp)
0.5305896997451782

Here we see the words “farmer” and “farmer” though spelled the same, are not very similar from a meaning perspective. However the words “farmer” and “agriculture” are much more similar.

Query Word Vectors with Annoy

Now we will query our word vectors using the package Annoy: https://github.com/spotify/annoy
This is a fabulous approximate nearest neighbors package that I use a lot for querying word vectors.

Our goal here is to identify words that are similar to our main word of interest, grocery in this case. Finding all of these variation will give us a list of search terms to use to identify other businesses whose names have meanings similar to grocery store.

aidx = annoy.AnnoyIndex(300, "angular")
for i in bwd.values():
    aidx.add_item(*i)
aidx.build(n_trees=300)
True

Now we can query the index and find the most similar words to our word of interest, “grocery”.

groc_words = [lemma_sub[i] for i in aidx.get_nns_by_item(bwd["grocery"][0], 20)]
groc_words
['grocery',
 'grocer',
 'minimart',
 'newsagency',
 'newsagent',
 'healthfood',
 'supermarket',
 'waitrose',
 'hypermarket',
 'supercenters',
 'store',
 'shoppings',
 'shopper',
 'shoping',
 'knish',
 'enoteca',
 'presliced',
 'delicatessen',
 'carryout',
 'bodega']

So not all of these words would exactly identify a grocery store, but it still provides us with more alternatives than just searching “grocery” alone.

Final Search Terms

Finally, let’s grab any variations of these words that may exist in our pre-lemmatized data. We will add these variations to our search terms.

search_terms = {}
for w in groc_words:
    search_terms[w] = []
lemma_tuples = [(i, j) for i, j in zip(proc_lemma, proc_words) if i in groc_words]
lemma_tuples[0:4]
[('minimart', 'minimart'),
 ('newsagent', 'newsagents'),
 ('knish', 'knish'),
 ('store', 'stored')]

We will append all of these variations into our search term dictionary.

for i, j in lemma_tuples:
    search_terms[i].append(j)
search_terms
{'grocery': ['grocery', 'groceries'],
 'grocer': ['grocer', 'grocers'],
 'minimart': ['minimart'],
 'newsagency': ['newsagency'],
 'newsagent': ['newsagents', 'newsagent'],
 'healthfood': ['healthfood'],
 'supermarket': ['supermarket', 'supermarkets'],
 'waitrose': ['waitrose'],
 'hypermarket': ['hypermarkets', 'hypermarket'],
 'supercenters': ['supercenters'],
 'store': ['stored', 'storing', 'store', 'stores'],
 'shoppings': ['shoppings'],
 'shopper': ['shopper', 'shoppers'],
 'shoping': ['shoping'],
 'knish': ['knish'],
 'enoteca': ['enoteca'],
 'presliced': ['presliced'],
 'delicatessen': ['delicatessen', 'delicatessens'],
 'carryout': ['carryout'],
 'bodega': ['bodega', 'bodegas']}

I am actually going to drop the “store” and “newsagency” terms.

del search_terms["newsagent"]
del search_terms["newsagency"]
del search_terms["store"]

If I was doing this for a formal project, I would probably think through the best way to use these word variations to query the business names. Additionally I would try to also identify common misspellings of these words.

grocery_name_parts = [p for var in search_terms.values() for p in var]

We want to make sure we only are considering the word if it is the actual word, and not a subword.

grocery_pattern = r"|".join([f"^{i}$| {i}$| {i} |^{i} " for i in grocery_name_parts])
grocery_stores = [
    name for name in bus_dat["name"] if bool(re.search(grocery_pattern, name))
]
len(grocery_stores)
2571
grocery_stores[0:15]
['shoppers drug mart',
 'wm morrison supermarkets plc',
 'waitrose',
 'c&s wholesale grocers',
 'hannaford supermarkets',
 'woolworths supermarkets',
 'price chopper supermarkets',
 'ralphs grocery company',
 'shoppers stop',
 'shoprite supermarkets',
 "shaw's supermarkets",
 'save mart supermarkets',
 'southeastern grocers',
 'brookshire grocery company',
 'associated wholesale grocers']

As we can see we could use a process like this to easily identify a business type, a grocery store in our case, just using information from the business name.