Mar
11

Using LINQ to list all duplicates

posted on 11 March 2011 in programming with 0 Comments

Warning: Please consider that this post is over 6 years old and the content may no longer be relevant.

There’s plenty of examples on how to find duplicates using LINQ’s GroupBy method, but usually they use a projection to return a new object, like this:

_filteredSubmissions = (from s in _filteredSubmissions
                        group s by s.Email
                        into g
                        where g.Count() > 1
                        select new { Emails = g.Key, DuplicateCount = g.Count() }

Which will just return any email addresses that are duplicates and the count. But what if you want to list all the original items that are duplicates? If you just ‘select g’ in the example above you’ll end up with an IEnumerable<IEnumerable>, not what we want. This is where SelectMany() is perfect, it will flatten our collection of a collection back down to a single collection, perfect!

_filteredSubmissions = (from s in _filteredSubmissions
                        group s by s.Email
                        into g
                        where g.Count() > 1
                        select g).SelectMany(g => g)

So now I want to find Submissions that have the same first and last names, I use an anonymous type.

_filteredSubmissions = (from s in _filteredSubmissions
                        group s by new { s.FirstName, s.LastName }
                        into g
                        where g.Count() > 1
                        select g).SelectMany(g => g)

And finally I want to find all Submissions that have the same first name and last name OR the same email. Simply use Union(), this will join the two collections and remove any duplicates (that’s duplicate instances of the classes, not to be confused with the duplicates we’re trying to find).

_filteredSubmissions = (from s in _filteredSubmissions
                        group s by new { s.FirstName, s.LastName }
                        into g
                        where g.Count() > 1
                        select g).SelectMany(g => g)
    .Union(
        (from s in _filteredSubmissions
         where !string.IsNullOrWhiteSpace(s.Email)
         group s by s.Email
         into g
         where g.Count() > 1
         select g).SelectMany(g => g)
    );


Note that I also stuck in a check for blank emails, I don’t consider a Submission the same just because they both have no email address, I didn’t do this on the FirstName or LastName fields because I know they are mandatory in my system.