[Wranglers] Using AI_SIMILARITY function in Snowflake as a JOIN CONDITION
Ali, Saqib
docbook.xml at gmail.com
Thu Jun 12 15:25:21 UTC 2025
Recently I had to analyze a dataset to identify customers who had used
similar looking addresses (e.g. 111 Orchard St vs. 111 Orchard Street
vs. 111 Orchard St.) to gain access to multiple quantities of Gift With
Purchases (GWP). The Gift With Purchase (GWP) campaign was designed for 1 /
customer / address. So we needed to identify all the customers who used
variations of their address to gain access to multiple quantities of the
GWP.
I ended up trying the new AI_SIMILARITY function in Snowflake for this. I
used the AI_SIMILARITY as a JOIN PREDICATE.
Here is a sample dataset.
ID,FIRST_NAME,ADDRESS_STREET
1,Maryam,111 Orchard St
2,Maryam,111 Orchard Street
3,Maryam,111 Orchard St.
4,maryem,111 Orchard Way
And here is the query I used:
select distinct *
from customers as a
inner join customers as b
on ai_similarity(a.address_street, b.address_street) > 0.9 -- threshold
can be adjusted
and ai_similarity(a.first_name, b.first_name) > .9 -- threshold can be
adjusted
and a.id <> b.id;
And here is the output from the sample dataset:
ID,FIRST_NAME,ADDRESS_STREET,ID,FIRST_NAME,ADDRESS_STREET
3,Maryam,111 Orchard St.,1,Maryam,111 Orchard St
2,Maryam,111 Orchard Street,1,Maryam,111 Orchard St
1,Maryam,111 Orchard St,3,Maryam,111 Orchard St.
3,Maryam,111 Orchard St.,2,Maryam,111 Orchard Street
2,Maryam,111 Orchard Street,3,Maryam,111 Orchard St.
1,Maryam,111 Orchard St,2,Maryam,111 Orchard Street
While this query worked, it was extremely slow. In fact it was timing out
on the full dataset even with the largest Snowflake Warehouse. So I ended
up segmenting the Customer Records by State to get the query to work on
small subsets of data.
I was pleasantly surprised how well this AI_SIMILARITY function worked for
identifying similar addresses. I am not sure if Snowflake can work on
optimizing this function.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://analyticsengineering.net/pipermail/wranglers/attachments/20250612/afc03811/attachment.htm>
More information about the Wranglers
mailing list