Deduplicating 40k SNAP news articles with MinHash LSH

I needed to identify and tag duplicate news articles from a Google Alerts RSS feed which filters the news through keywords while working on topic modeling of news coverage on the Supplemental Nutrition Assistance Program since summer 2024.

I also want to in the future group these on the surfaced SNAP News site I put together for these articles.

After reading it for around a year, I consider duplicates / near duplicates to be an overall rare, but noticeable phenomenon. This post quantifies that feeling, explores why it occurs, and discusses some implications for topic modeling.

The code for this analysis (and almost all of the text) is available in a GitHub repository here (note that the data visualizations in this post are exploratory and not the most beautiful, but they are not the focus here).

Why do duplicates appear in the dataset at all?

News outlets will frequently re-publish the same article across different websites. For example, AP News might have an article that surfaces in 2 different news aggregators that are both captured in the Google News Alerts feed. These make their way into the dataset as either exact or near exact datasets.
- The most common case is that the source differs, but the content is the same, likely indicating it is syndicated content; for example:
  - Oregon AG Dan Rayfield sues Trump over latest changes to SNAP benefits - KLCC
  - Oregon AG Dan Rayfield sues Trump over latest changes to SNAP benefits - OPB
    - If I want to compare the popularity of a topic, the fact that an article has been republished across multiple different outlets may be a valuable indicator that the topic is considered important relative to other, unduplicated articles.
    - However, there could be a deal among specific outlets to always republish articles, which is the wire service; e.g., all articles written by Bob at AP News are always published both in KLCC and OPB.
    - The Mining of Massive Datasets book references this explicitly, indicating that Google News tries to deduplicate as well; I’m assuming the News Alerts may also do this? And some just slip through?
    - I am assuming that these types of articles are filtered, e.g., there is a person at the receiving agency that picks specific articles from all provided by the wire service to publish. With this, I would still want to retain it in the model as it would indicate how “popular” the article is.
- Another case has been the same title, but slightly different indexed description; for example:
  - (title) Alaska didn’t use $5M set aside to fund SNAP during the shutdown, even though benefits were late
    - (description) ANCHORAGE, Alaska (KTUU) - Alaskans who qualify for the Supplemental Nutrition Assistance Program … SNAP is a federal food assistance program that is … KTUU
    - (description) Alaskans who qualify for the Supplemental Nutrition Assistance Program (SNAP) received half of their benefits nearly a week late as a result of … KYUK
Extremely similar content may also be published, but slightly adapted; for example, a news station republishing a Governor’s press release:
- (title) Gov. Landry Announces Louisiana SNAP Recipients Now Receiving Full November Benefits
- (title) Governor reassures that Louisiana SNAP recipients now receiving full November benefits
The same content may be duplicated with different indexed descriptions; for example, an indexed Facebook post, which got duplicated (perhaps because of comments?):
- (title) NY AG Letitia James is ORDERING the USDA to change its guidelines for SNAP eligibility, saying
  - (description) “… Supplemental Nutrition Assistance Program (SNAP). The measure is … SNAP, or food stamps, as a lifeline. The federal government must do its …”
  - (description) “It was found dead people in one state were receiving SNAP benefits in another. 10 hrs.”
Multi-media articles may also be indexed multiple times:
- Some Northeast Ohio families are still waiting for their SNAP benefits has a written version and an accompanying YouTube video.
Automated scheduled content may use nearly the same text each time it’s published with injected parameters.
- (title) Snap Inc. stock underperforms Friday when compared to competitors - MarketWatch
- (description from publish date 7/19/25) “falling 0.93% to 40,287.53. Snap Inc. closed $3.50 below its 52-week high ($17.90), which the company reached on December 18th. The stock …”
- (description from publish date 7/27/25) “rising 1.64% to 40,589.34. This was the stock's fourth consecutive day of losses. Snap Inc. closed $4.57 short of its 52-week high ($17.90), which …”

How many, if any, exact duplicates exist in the dataset?

From the 41597 articles in the dataset, grouping on the title of each article returns 1038 titles with 1+ instances of that title in the dataset, or 2.60%. Grouping on the description returns 646 descriptions, or 1.59%. Concatenating the title and the description, then grouping on the text again, returns 205 instances, or 0.50%.

Overall, 3588 (8.63%) items in the dataset have either the title, description, or both duplicated in the dataset.

Chart showing higher rates of exact title duplicates, then description, then both, across same source and date to same source and different date, to different source and same date, then different source and different date

Generally, the matches on title only dominate each category, which likely reflects the influence of syndicated content (or similar practices in headline creation, perhaps).

After a brief manual reviewing of the sets, I chose to

drop articles of any match type (title or description or both) when published on the same day on the same site. These largely appear to be indexing updates.
drop the latest (by date) duplicated item when the text and source are the same. These articles typically look like reindexed content that don’t provide any additional value.

I do retain duplicate values when multiple sites are publishing the same content, as this indicates to some extent the topic might be of more interest to more people. I don’t think this is likely to hurt the topic model. I also (from reviewing) think that some of this automated content will be “off-topic”, and filtered into its own group within the topic model; for example, automated sports headlines.

Matching on duplicated titles, descriptions, or full text, I dropped 581 articles from the same source published on the same date, or 1.40% of articles, with 41016 articles remaining. Matching on the same title and description, I dropped 93 articles from the same source published on different dates, or 0.23% of articles, with 40923 articles remaining.

Applying MinHash LSH for near duplicates

I used the MinHash LSH algorithm from the datasketch Python library, with guidance from Mining of Massive Datasets based on a blog post from Nelson Elhage that implemented this from last year. While ~37k articles is not a large dataset to deduplicate by modern standards (compared to recent work deduplicating the datasets LLMs are trained on), it still would represent about 37k^2 / 2 comparisons to calculate the Jaccard similarity directly for each potential pair.

One consideration for MinHash (and for topic models, looking further out) is that the comparisons between article A and B are done between a set number of permutations of the minimum value in a set of hashed shingles. Note that the shingles overlap, such that the string "foobar", where shingle length is 2, would be ['fo','oo','ob','ba','ar'].

The more permutations of each item, the more samples of the minimum value you have from each article. The more samples, the less variance in the resulting estimate.

For longer articles, this is great, since comparing 256 values for each article pair is better than comparing 2 articles with potentially thousands of words each. But I only have the title and description from each article, meaning that the number of shingles is almost always less than 256, so a direct Jaccard similarity between 2 articles is less operations than trying to use a hashed version (still it may be faster to use the hashed version in a direct comparison, just due to some optimizations the library implemented, but more on that later).

So MinHash, by itself, isn’t the key to speeding this up. This is not something that I realized when I started implementing the MinHash algorithm.

Instead, most value comes from locality-sensitive hashing extension to reduce the number of comparisons entirely. We hash articles to buckets, then perform article to article comparisons for articles hashed to the same bucket.

This is similar to using blocking rules in fuzzy matching projects I have done with Splink, another Python library, where the number of comparisons is limited based on hand-picked rules that only allow matches within defined groups, e.g., only check duplicate locations when they have the same zip code.

There are some alternative approaches that I’d like to review, including SimHash and FAISS. Google uses MinHash LSH for news, it seems, so figured it would be a good starting point. A 2014 paper linked from the SimHash Wikipedia page suggests MinHash LSH may be the better choice on binary data, but it is highly technical and I need more time to understand it. A StackOverflow answer suggests speed and memory might be the concerns here, but I’ll have to implement it to understand it. FAISS, as far as I can tell, seems to primarily be designed for much much much larger datasets, so I skipped it for now.

Chart showing average length of shingled documents is 172, maxing out at about 240, and skewed to the right, with relatively few articles under 100 shingles

The original code I wrote following this GitHub discussion and the Mining Massive Datasets book ran slowly, about ~15 minutes, when I ran it across all articles, using ngrams (split on spaces), 128 permutations, and the default hash function provided by the datasketch library.

Note that while MMD book does suggest using an alternative shingles method to identify similar news articles, based on splitting on stop words and including the following two words. For this exercise, I don’t consider this necessary -> the MMD book implies that in this case, the documents include the surrounding text of the webpage (advertisements, etc.), and the goal was to prioritize the article text. Since we have no text other than the title and leading description, I don’t consider this necessary.

Based on review from the Data Preparation notebook, I’m not concerned about skewed scoring due to sets of wildly different lengths, but it was something I explored. Overall the dataset is limited in length, in that it does not include the article text, just a description.

After properly reading the documentation and reviewing the code again, I switched to proper shingles of length 5 (as suggested by the Mining of Massive Datasets book in 3.2.2), switched to 256 permutations, implemented MinHash.bulk rather than instantiating a MinHash for each set (saving the overhead of creating a new object), and used a hash function from farmhash to calculate the hashes of each set faster.

These changes meant the explore_thresholds function ran in about 5 minutes, even with the increased permutations.

I also realized that I was pulling the suggested pairs, then comparing the text as a set, rather than comparing the hashed versions with the built-in jaccard function. But based on the fact that the mean shingles length of each set is around 170, this shouldn’t change much - on testing, writing a simple Jaccard comparison function added perhaps a minute to the total running time.

I was a bit confused by this at first, but reviewing it with an LLM suggests that the MinHash object might be quicker to query (not qualified enough to assess this),that the comparisons through integer hashes is quicker than the set itself which is stored as bytes, and that directly comparing the set would require me to recompute a bit of the work that I already completed when I originally calculated the MinHash values.

Side note: this is one example of the types of questions that I struggle with asking LLMs. I have had enough cases of “the LLM just said X is absolutely right, but when I pointed out a problem, it responded saying that Y is entirely correct” to not feel confident about the answers it gives when I inquire about something I’m not familiar with.

Practically, with only an additional 30 seconds, it makes sense still to directly compare the sets, since this would eliminates approximation errors for comparisons right at the boundary.

Chart showing difference between number of duplicates found when using exact Jaccard Similarity scores versus the approximation, which both lines similarly dropping steadily from 1,000 to 100 from a .75 to .95 threshold

Plotting it all checks out. There isn’t much difference between the MinHash and direct comparisons, so that is reassuring. In a future post, I’d like to explore the differences a bit further, however. Looking at some of the matches:

Title: Kentuckys Cornbread Hemp First CBD Company In State With USDA Organic Certified Products
Description: Learn More Author and journalist Jim Higdon told the story of a Kentucky based criminal marijuana syndicate in his 2013 book

matched to

Title: Kentuckys Cornbread Hemp First CBD Company In State With USDA Organic Certified Products
Description: Learn More Author and journalist Jim Higdon told the story of a Kentucky based criminal marijuana syndicate in his 2013 book

Title: Florida SNAP Food Restriction Waiver USDA Food and Nutrition Service
Description: Florida SNAP Food Restriction Waiver Breadcrumb Home Supplemental Nutrition Assistance Program SNAP SNAP Rule Waivers SNAP Food Restriction

matched to

Title: Texas SNAP Food Restriction Waiver USDA Food and Nutrition Service
Description: Texas SNAP Food Restriction Waiver Breadcrumb Home Supplemental Nutrition Assistance Program SNAP SNAP Rule Waivers SNAP Food Restriction

Title: Owen Wolff Austin snap LAFCs 6 game win streak with 1 0 victory kare11 com
Description: Owen Wolff scored in a goal in the 83rd minute Brad Stuver had two saves and Austin FC beat Los Angeles FC 1 0 to snap LAFCs six game win streak

matched to

Title: Owen Wolff Austin snap LAFCs 6 game win streak with 1 0 victory CBS Los Angeles
Description: Owen Wolff scored in a goal in the 83rd minute Brad Stuver had two saves and Austin FC beat Los Angeles FC 1 0 to snap LAFCs six game win

And relative to the exact string matching, how many of the suggested duplicates from MinHash are new?

Summary of Duplicate Pairs by Thresholds:

Threshold 0.75: 64 exact, 998 partial duplicates (3.34%).
- Overlap: 34.24% also have exact title match, 40.89% also have exact description match
Threshold 0.8: 64 exact, 649 partial duplicates (2.29%).
- Overlap: 40.41% also have exact title match, 49.04% also have exact description match
Threshold 0.85: 64 exact, 409 partial duplicates (1.55%).
- Overlap: 48.35% also have exact title match, 56.69% also have exact description match
Threshold 0.9: 64 exact, 224 partial duplicates (0.88%).
- Overlap: 64.9% also have exact title match, 64.35% also have exact description match
Threshold 0.95: 64 exact, 86 partial duplicates (0.39%).
- Overlap: 98.12% also have exact title match, 76.25% also have exact description match

At the .95 similarity threshold level, exact string matching on the title caught 98.12% of the duplicates included in the 86 MinHash suggested duplicates. At .75, MinHash raises another 998 partial duplicates not covered by exact string matching.

Choosing a threshold & validating matches

However, it’s tough to understand how to decide, in practice, what threshold I should set the MinHash to to assign the duplicate label to, without just trying a bunch and reviewing the results.

A blog post from the person who created Splink, another fuzzy matching Python library I use for entity matching at work, recommends using scenario-based quality testing. This is based on grouping the comparison vectors and testing examples within each group. So in this case, an example would be test potential matches only where the source domain matches, but not the publish date. I think this initial deduplication can help guide new rules as to when I group articles for the website itself and gain a sense of the potential scale and type of duplication I need to look out for.

Practically, I can build a validation set, manually review it, and use it as a test benchmark for precision and recall at different thresholds.

However, I’d like to build the topic model prior to doing this and filter out “off-topic” articles. Based on reviewing the pairs, many of the duplicated articles are automated content -> like articles with very high similarity scores that just reflect different dates.

Take sports scores for week 6 scores versus week 7! If I mark these as not-duplicates, which is correct, then any evaluation metric will be driven down, since the similarity score will assuredly assign them as duplicates. This is despite I’m going to filter out sports related content anyway.

One option I have considered is “feature engineering”, or switching to Splink. I could parse out any named month, e.g., April report vs May Report, and include this as a factor to weigh the similarity score by. Same goes for any U.S. state mentiond, and the publish date as well.

So, since I am only concerned about a specific topic of interest, the workflow will be the topic model, then deduplicate. This also works in this situation since the RSS feed and Github Action have duplicate prevention as well, so I’m not concerned about being overwhelmed by duplicated content.

Hopefully, with a more targeted validation dataset as well, I can explore more options to deduplicate and merge using the additional attributes and have better automated metrics to reflect meaningful improvements and better tune a threshold. But we’ll see, I suppose. I wrote up a separate script to export a validation dataset that I will keep for later.

Exploring threshold differences

One note I wanted to check is that since MinHash is probabilistic, within each set of results by threshold, pairs which appear in a high similarity threshold match may not be included in matches with a lower similarity threshold match. E.g., a match at 90% similarity might not be found in 80% similarity pairs. Checking to see how true this is:

Comparing threshold 0.8 vs 0.75

Is 0.8 a subset of 0.75? False
Pairs in 0.8 but NOT in 0.75: 8
Sample differences: [(9957, 24592), (1925, 1944), (25334, 25339)]

Comparing threshold 0.85 vs 0.8

Is 0.85 a subset of 0.8? False
Pairs in 0.85 but NOT in 0.8: 3
Sample differences: [(13665, 13693), (7411, 7413), (3154, 5367)]

Comparing threshold 0.9 vs 0.85

Is 0.9 a subset of 0.85? True
Pairs in 0.9 but NOT in 0.85: 0

Comparing threshold 0.95 vs 0.9

Is 0.95 a subset of 0.9? True
Pairs in 0.95 but NOT in 0.9: 0

Note that the results of the MinHash function will vary, just like the matches themselves depending on the run. However, with 256 permutations versus 128, we get closer to each successive threshold being a subset.

Do duplicates occur regularly over time? What is the time difference between published duplicates?

For each set of duplicates, check the following scenarios:

article published 1+ days later by same source
- how long in between
article published 1+ days later by different source
- how long in between
article published same day by the same source
article published same day by the different source

Also check overlap across sources (the website domain where the article is found).

Which article appears the most often?
Which publishers have the most ‘overlap’, e.g., publishers with the most duplicates between them?

One thing I was curious about is if any trends emerge by threshold. E.g., it would make sense to me that the average time difference in publications in the .75 similarity threshold would be higher than in the .95 threshold. .95 threshold articles perhaps are the exact same republished content; a .75 similarity could just reflect totally different articles posted within a few months of another.

So, I iterate over the thresholds in descending order, removing already seen articles as I go, so I am really only looking at the articles included in that band.

Note that I don’t know if the publish time from the RSS feed is reliable, but will have to follow up another time.

2 charts, the first showing the number of pairs grouped by whether they were published on the same day and from the same source, decreasing from the .75 to .95 threshold. The 2nd chart shows that the median time difference (among articles for which one exists) is around 60 days. Same source duplicates consistently have higher time differences.

These are exploratory visualizations, so not going to take too much time to polish them. Looking at the plots, I see:

Not too much variation by threshold.
Clearly skewed dataset for the thresholds.
My bet is that the very low median time difference for articles published through different sources is indicative of syndicated content.
- And my guess is that through same source, it might be regular update on the same topic, which would explain the 3 to 4 weeks gap in the median.
I think automated publications explain why at higher thresholds, more articles might be published farther apart, e.g., regular updates being compared to many in the past (last year’s sports scores to this year, for example).
Given that the largest median difference is just under 60 days, I want to look at what some of the outliers that drag the mean up. (Note that by removing the density_norm="count" argument in the sns.violinplot command you can better see this)

What are some of the matches over 60 days?

0.95 threshold: 9/40 pairs (22.50%) published 60+ days apart (excluding same-day)
0.90 threshold: 32/98 pairs (32.65%) published 60+ days apart (excluding same-day)
0.85 threshold: 52/132 pairs (39.39%) published 60+ days apart (excluding same-day)
0.80 threshold: 60/183 pairs (32.79%) published 60+ days apart (excluding same-day)
0.75 threshold: 93/235 pairs (39.57%) published 60+ days apart (excluding same-day)

Across thresholds (excluding same-day publications),

246/688 pairs (35.76%) published 60+ days apart
442/688 pairs (64.24%) published within 60 days

Some examples:

Title: Baby Floral Snap Up Cotton Sleep Play Pajamas Carters
Description: Baby Floral Snap Up Cotton Sleep Play Pajamas from carters com Shop clothing accessories from a trusted name in kids toddlers and baby

matched to

Title: Baseball Snap Up Cotton Sleep Play Pajamas Carters
Description: Baseball Snap Up Cotton Sleep Play Pajamas from carters com Shop clothing accessories from a trusted name in kids toddlers

Title: IGA House Bill 1486 Use of SNAP benefits IN gov
Description: Digest Prohibits recipients of Supplemental Nutrition Assistance Program SNAP benefits from using SNAP benefits to purchase accessory foods

matched to

Title: House Bill 1263 Use of SNAP benefits IN gov
Description: Digest Prohibits Supplemental Nutrition Assistance Program SNAP benefits recipients from using SNAP benefits to purchase accessory foods

Title: IGA House Bill 1486 Use of SNAP benefits IN gov
Description: Digest Prohibits recipients of Supplemental Nutrition Assistance Program SNAP benefits from using SNAP benefits to purchase accessory foods

matched to

Title: House Bill 1263 Use of SNAP benefits IN gov
Description: Digest Prohibits Supplemental Nutrition Assistance Program SNAP benefits recipients from using SNAP benefits to purchase accessory foods

Title: Snap Count Observations Transactions to Make for Week 4 Fantasy Football
Description: After looking after every teams snap count and usage rates here are some transactions fantasy managers can make for Week 4

matched to

Title: Snap Count Observations Transactions to Make for Week 6 Fantasy Football
Description: After looking after every teams snap count and usage rates here are some transactions fantasy managers can make for Week 6

Helpful! Cycling through these results a few times shows

Lots of weekly football scores being compared to other weeks,
some new policy updates or regular forecasts, and
re-published advertisements

While this isn’t full coverage, it does indicate that an all time duplicates comparison might be misleading. I don’t want to lose yearly USDA forecasts, for example. This can help inform a blocking rule in the future to just limit comparisons to a time-based interval, i.e., compare each article to other articles published with ~50 days.

Note that the topic model will be helping me filter out “off-topic”, (Marvel Snap news), articles in the future, so I will not worry about that now.

Duplicates by Source

Last thing is looking at the relationships between the sources of the duplicates. For this, I use duplicates from both the exact title and/or description matching alongside partial duplicates at the .75 threshold.

1251 total sources are involved in duplicates,

source	duplicate_count
waukonstandard.com	870
defensecommunities.org	274
usda.gov	232
recorderonline.com	141
fsa.usda.gov	138
usda.library.cornell.edu	128
marketwatch.com	128
marvelsnapzone.com	124
thefantasyfootballers.com	110
x.com	91
fsis.usda.gov	91

And across sources, for 1724 total source pairs (note this doesn’t reflect an “origin”, exactly):

source_a	source_b	shared_duplicates
fsa.usda.gov	glasgowcourier.com	13
usda-ec2-prod.library.cornell.edu	usda.library.cornell.edu	10
usda-production.library.cornell.edu	usda.library.cornell.edu	10
fsa.usda.gov	oklahomafarmreport.com	10
bladenonline.com	fsa.usda.gov	7
greatamericancrop.com	usda.gov	5
foxsports.com	wtop.com	5
californiaagnet.com	fsa.usda.gov	5
finance.yahoo.com	stockstory.org	4
1010wcsi.com	hoosieragtoday.com	4
shoprite.com	thefreshgrocer.com	4
cbssports.com	ksl.com	4
cbssports.com	foxsports.com	4
foxsports.com	ksl.com	4
priceritemarketplace.com	thefreshgrocer.com	4
seattlepi.com	wtop.com	4

The first time I tried this, I discovered that many of the “different source” assignments reflected poor data preparation on domains; for example, “usda.com” => “www.usda.com”. So I had to go back to the data cleaning notebook and fix this. The “cornell.edu” domain with its subdomains are unfortunate, but not worth writing something special to fix at this point. The Waukon Standard obviously is noticeable; each week they publish the Whats Up at the USDA Office Waukon Standard; similarly, “defensecommunities.org” publishes Snap of the Week Association of Defense Communities.

While the Association of Defense Communities website is off-topic, this type of automated content is tricky to handle, since it will invariably be picked up by similarity or exact string matching.

To wrap this up, an exploratory network chart showing the overlap between sources.

Circular network chart showing lines between domains, representing articles in common. 2 large clusters form, while most only have a couple shared domains

Taking “ksl.com” as an example, the network chart, along with the other source tables above, indicates that “ksl.com” had articles in common with 54 other sources across 70 suggested duplicates, the top sharers being “cbssports.com” and “foxsports.com”.

More websites than I expected had an article in common with only 1 other website, which is something I’d like to explore further. For example, I notice more custom subdomains, like “facebook.com” and “m.facebook.com”. I’m wary of removing too many of these, however; for example, “delauro.house.gov” and “courtney.house.gov” is a meaningful difference.

Additionally, you can see the flaws in the similarity matching as well a bit more intuitively. For example, a small cluster around “michigan.gov”, “oklahoma.gov”, and local school websites shows how a long description and short title make for misleading comparisons:

Close up of the network chart showing lines between a subset of domains

Title: Child Nutrition Programs CNP Bulletin No 2 State of Michigan
Description: In accordance with federal civil rights law and U S Department of Agriculture USDA civil rights regulations and policies this institution is

matched to

Title: DEIB The Family Nurturing Center
Description: In accordance with federal civil rights law and U S Department of Agriculture USDA civil rights regulations and policies this institution is

The News Alert feed often can pick up updates to these types of website announcements which are not necessarily “news”, meaning boilerplate text can be common. I ran into this same issue with my analysis of the 2025 government website shutdown banners. If this isn’t caught in the topic model, which will do a better job of matching semantic meaning, then I may need to go back and resolve this in the data cleaning scripts.

How would they affect the resulting modeled topics? Does it lead to worse topics?

Truthfully, I’m not quite sure yet. I’ve tried a few iterations of topic models, and I don’t think this should have a large effect either way given the limited number.

However, now that I have the indices of the duplicates, I can follow up with more questions. For example, are there certain topics that see more duplicates than others?

After exploring the topic models created, I’ll come back and update this, or at least link to another post.

Notes

How can I merge these in the SNAP News site?

By merge, I mean that it would be nice to have an article that is, for example, published in 3 sources have just a small (3) next to it, with the sources in little bubbles, rather than having the same title and description repeated 3 times.
This feels like just a mini-topic model in itself over a very small time period and dataset.
Or, I just compute the Jaccard similarity directly for each article with a time frame of N days, where N is set by my review of duplicates in the current dataset (e.g., 98% of proposed duplicates at the 99% threshold are published within 5 days of one another). There would be both same-day duplicates and “previously published” duplicates from prior days. Based on this analysis, this feels like the right approach to start with.

Google Alerts

I’m not sure how many duplicates (or near duplicates) are filtered out before I pull the RSS feed from Google Alerts. I haven’t found any clear documentation, but a few pages suggests that Google News does try to identify an original article. So it is possible that I am only dealing with the ’leftovers’ that the Alerts system missed.

Avoiding duplicated feed submissions

The GitHub Action workflow also filters out complete row duplicates across all attributes.

---
- name: Deduplicate CSV
  run: |
    awk -F, '!seen[$0]++' _data/articles.csv > deduped_articles.csv
    mv deduped_articles.csv _data/articles.csv

The awk command is a unpleasantly terse function from an LLM which iterates over each line $0, incrementing if already seen, and relies on 0 being false and any positive value resolving to true.

Negating it with the exclamation point means the first occurrence (0) is true, and the second occurrence and so on (1, 2, …) will be false.

The second line overwrites the existing data with the deduplicated data. I don’t track how often values are filtered out from this step, but I consider these indexing errors.

Data Preparation

In the code repository, this is the second notebook after the data cleaning notebook. I don’t really go over the data cleaning notebook; it’s pretty straightforward, just cleaning up HTML and parsing fragments. While doing analysis for the deduplication, I did have to go back and change a couple things, for example, parsing the source where it was NaN.

Why do duplicates appear in the dataset at all?#

How many, if any, exact duplicates exist in the dataset?#

Applying MinHash LSH for near duplicates#

Choosing a threshold & validating matches#

Exploring threshold differences#

Do duplicates occur regularly over time? What is the time difference between published duplicates?#

Duplicates by Source#

How would they affect the resulting modeled topics? Does it lead to worse topics?#

Notes#

How can I merge these in the SNAP News site?#

Google Alerts#

Avoiding duplicated feed submissions#

Data Preparation#

Why do duplicates appear in the dataset at all?

How many, if any, exact duplicates exist in the dataset?

Applying MinHash LSH for near duplicates

Choosing a threshold & validating matches

Exploring threshold differences

Do duplicates occur regularly over time? What is the time difference between published duplicates?

Duplicates by Source

How would they affect the resulting modeled topics? Does it lead to worse topics?

Notes

How can I merge these in the SNAP News site?

Google Alerts

Avoiding duplicated feed submissions

Data Preparation