Skip to main content

Cosine Similarity

Cosine similarity measures the similarity between two vectors of an inner product space. By using the CosineSimilarityVectorizer, you can compute the vector of cosine similarity. This vectorizer needs a vector to compute the cosine similarity. If you don't have a vector and provide a default transactions SFrame, the pivot vectorizer will automatically compute it, and the cosine similarity will be based on it.

How it works

This vectorizer uses the cosine_similarity function from the scikit-learn library. The result will be kept in a separate SFrame called vector.

As mentioned before, you can provide the vector for it, or the vectorizer will use the pivot vectorizer with default configuration.

Example

Assume that we have the following vector:

from seshat.data_class import DFrame, GroupSFrame

sf = GroupSFrame(
children={
"vector": DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_3",
],
"feature_1": [1, 12, 20],
"feature_2": [100, 101, 40],
"feature_3": [0.1, -2.2, -0.3],
}
)
}
)

By configuring the cosine similarity vectorizer and calling it, you will have the similarity vector in the result.

from seshat.transformer.vectorizer import CosineSimilarityVectorizer

vectorizer = CosineSimilarityVectorizer()
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_1 address_2 address_3
address_1 1.000000 0.993891 0.898827
address_2 0.993891 1.000000 0.940847
address_3 0.898827 0.940847 1.000000

Square Shape

By default, the cosine similarity is an n * n matrix. However, in some cases, you may prefer to have this in a table with 3 columns. This format is very useful when you want to save it in a database, for example.

Example

Assume that you achieve the similarity vector as shown above. If you change the square_shape to False, the result is:

vectorizer = CosineSimilarityVectorizer(square_shape=False)
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_2 address_1 cosine
0 address_2 address_1 0.993891
1 address_3 address_1 0.898827
2 address_3 address_2 0.940847

You can set the result column names using address_1_col, address_2_col for the addresses, and cosine_col for the cosine value.

vectorizer = CosineSimilarityVectorizer(
square_shape=False,
address_1_col="first_address",
address_2_col="second_address",
cosine_col="value_of_cosine",
)
sf = vectorizer(sf)
>>>
second_address first_address value_of_cosine
0 address_2 address_1 0.993891
1 address_3 address_1 0.898827
2 address_3 address_2 0.940847

Threshold

The result of cosine similarity may be very large. If you set the square_shape to False, you can add a threshold to keep only the useful results. For example, for every address, you can keep the top 50 similar addresses. The threshold argument is by default set to "none", meaning there is no threshold. You have other options for it, configured by threshold_value, which specifies the value for the threshold.

Threshold Types

The threshold can be of two types:

  • by count: By setting threshold to by_count, for each address, only the top addresses are kept, and the count of these addresses is equal to threshold_value. Note that top addresses mean those with the highest cosine values.

    vectorizer = CosineSimilarityVectorizer(
    square_shape=False, threshold_value=1, threshold="by_count"
    )
    sf = vectorizer(sf)
    sf["cosine_sim"].data
    >>>
    address_2 address_1 cosine
    0 address_2 address_1 0.993891
    1 address_3 address_2 0.940847
  • by value: This threshold keeps only the rows where the cosine value is greater than threshold_value.

    vectorizer = CosineSimilarityVectorizer(
    square_shape=False, threshold_value=0.9, threshold="by_value"
    )
    sf = vectorizer(sf)
    sf["cosine_sim"].data
    >>>
    address_2 address_1 cosine
    0 address_2 address_1 0.993891
    2 address_3 address_2 0.940847

Find Top Addresses

The similarity matrix shows which address is more similar to others based on cosine values. If you want to store these addresses in a separate SFrame, you can set find_top_address to True and define a top_address_limit that shows the limit of top addresses.

Example

vectorizer = CosineSimilarityVectorizer(find_top_address=True, top_address_limit=2)
sf = vectorizer(sf)
sf["top_address"].data
>>>
address cosine
0 address_2 2.934739
1 address_1 2.892718

Note that in this case, the value of the cosine column is the sum of the cosine values over all similarities.

Exclusion

Sometimes you may want to exclude some addresses from the vector just for computing the cosine similarity. One of these cases is when your vector includes some addresses that are contract addresses, and you don't want to consider them for similarity. In this case, you can set the exclusion key in group_keys equal to the key of the SFrame that contains the addresses to be excluded. You must define the column of values in the exclusion SFrame with the exclusion_token_col argument.

Example

sf = GroupSFrame(
children={
"vector": DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_3",
],
"feature_1": [1, 12, 20],
"feature_2": [100, 101, 40],
"feature_3": [0.1, -2.2, -0.3],
}
),
"exclusion": DFrame.from_raw({"address": ["address_1"]}),
}
)

vectorizer = CosineSimilarityVectorizer(exclusion_token_col="address")
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_2 address_3
address_2 1.000000 0.940847
address_3 0.940847 1.000000