Cosine Similarity
Cosine similarity measures the similarity between two vectors of an inner product space. By using
the CosineSimilarityVectorizer
, you can compute the vector of cosine similarity. This vectorizer needs a vector to
compute the cosine similarity. If you don't have a vector and provide a default transactions SFrame, the pivot
vectorizer will automatically compute it, and the cosine similarity will be based on it.
How it works
This vectorizer uses the cosine_similarity
function from the scikit-learn library. The result will be kept in a
separate SFrame called vector
.
As mentioned before, you can provide the vector for it, or the vectorizer will use the pivot vectorizer with default configuration.
Example
Assume that we have the following vector:
from seshat.data_class import DFrame, GroupSFrame
sf = GroupSFrame(
children={
"vector": DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_3",
],
"feature_1": [1, 12, 20],
"feature_2": [100, 101, 40],
"feature_3": [0.1, -2.2, -0.3],
}
)
}
)
By configuring the cosine similarity vectorizer and calling it, you will have the similarity vector in the result.
from seshat.transformer.vectorizer import CosineSimilarityVectorizer
vectorizer = CosineSimilarityVectorizer()
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_1 address_2 address_3
address_1 1.000000 0.993891 0.898827
address_2 0.993891 1.000000 0.940847
address_3 0.898827 0.940847 1.000000
Square Shape
By default, the cosine similarity is an n * n matrix. However, in some cases, you may prefer to have this in a table with 3 columns. This format is very useful when you want to save it in a database, for example.
Example
Assume that you achieve the similarity vector as shown above. If you change the square_shape
to False
, the result
is:
vectorizer = CosineSimilarityVectorizer(square_shape=False)
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_2 address_1 cosine
0 address_2 address_1 0.993891
1 address_3 address_1 0.898827
2 address_3 address_2 0.940847
You can set the result column names using address_1_col
, address_2_col
for the addresses, and cosine_col
for the
cosine value.
vectorizer = CosineSimilarityVectorizer(
square_shape=False,
address_1_col="first_address",
address_2_col="second_address",
cosine_col="value_of_cosine",
)
sf = vectorizer(sf)
>>>
second_address first_address value_of_cosine
0 address_2 address_1 0.993891
1 address_3 address_1 0.898827
2 address_3 address_2 0.940847
Threshold
The result of cosine similarity may be very large. If you set the square_shape
to False
, you can add a threshold to
keep only the useful results. For example, for every address, you can keep the top 50 similar addresses. The threshold
argument is by default set to "none", meaning there is no threshold. You have other options for it, configured
by threshold_value
, which specifies the value for the threshold.
Threshold Types
The threshold can be of two types:
-
by count: By setting
threshold
toby_count
, for each address, only the top addresses are kept, and the count of these addresses is equal tothreshold_value
. Note that top addresses mean those with the highest cosine values.vectorizer = CosineSimilarityVectorizer(
square_shape=False, threshold_value=1, threshold="by_count"
)
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_2 address_1 cosine
0 address_2 address_1 0.993891
1 address_3 address_2 0.940847 -
by value: This threshold keeps only the rows where the cosine value is greater than
threshold_value
.vectorizer = CosineSimilarityVectorizer(
square_shape=False, threshold_value=0.9, threshold="by_value"
)
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_2 address_1 cosine
0 address_2 address_1 0.993891
2 address_3 address_2 0.940847
Find Top Addresses
The similarity matrix shows which address is more similar to others based on cosine values. If you want to store these
addresses in a separate SFrame, you can set find_top_address
to True
and define a top_address_limit
that shows the
limit of top addresses.
Example
vectorizer = CosineSimilarityVectorizer(find_top_address=True, top_address_limit=2)
sf = vectorizer(sf)
sf["top_address"].data
>>>
address cosine
0 address_2 2.934739
1 address_1 2.892718
Note that in this case, the value of the cosine column is the sum of the cosine values over all similarities.
Exclusion
Sometimes you may want to exclude some addresses from the vector just for computing the cosine similarity. One of these
cases is when your vector includes some addresses that are contract addresses, and you don't want to consider them for
similarity. In this case, you can set the exclusion
key in group_keys
equal to the key of the SFrame that contains
the addresses to be excluded. You must define the column of values in the exclusion SFrame with
the exclusion_token_col
argument.
Example
sf = GroupSFrame(
children={
"vector": DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_3",
],
"feature_1": [1, 12, 20],
"feature_2": [100, 101, 40],
"feature_3": [0.1, -2.2, -0.3],
}
),
"exclusion": DFrame.from_raw({"address": ["address_1"]}),
}
)
vectorizer = CosineSimilarityVectorizer(exclusion_token_col="address")
sf = vectorizer(sf)
sf["cosine_sim"].data
>>>
address_2 address_3
address_2 1.000000 0.940847
address_3 0.940847 1.000000