Deduplicating 40k SNAP news articles with MinHash LSH
I needed to identify and tag duplicate news articles from a Google Alerts RSS feed which filters the news through keywords while working on topic modeling of news coverage on the Supplemental Nutrition Assistance Program since summer 2024. I also want to in the future group these on the surfaced SNAP News site I put together for these articles. After reading it for around a year, I consider duplicates / near duplicates to be an overall rare, but noticeable phenomenon. This post quantifies that feeling, explores why it occurs, and discusses some implications for topic modeling. ...