Finding all the users that have duplicate names

I have users which has first_name and last_name fields and i need to do a ruby find all the users that have duplicate accounts based on first and last names. For example i want to have a find that will search through all the other users and find if any have the same name and email. I was thinking a nested loop like this

User.all.each do |user|
 //maybe another loop to search through all the users and maybe if a match occurs put that user in an array
end

Is there a better way

Answers


You could go a long way toward narrowing down your search by finding out what the duplicated data is in the first place. For example, say you want to find each combination of first name and email that is used more than once.

User.find(:all, :group => [:first, :email], :having => "count(*) > 1" )

That will return an array containing one of each of the duplicated records. From that, say one of the returned users had "Fred" and "fred@example.com" then you could search for only Users having those values to find all of the affected users.

The return from that find will be something like the following. Note that the array only contains a single record from each set of duplicated users.

[#<User id: 3, first: "foo", last: "barney", email: "foo@example.com", created_at: "2010-12-30 17:14:43", updated_at: "2010-12-30 17:14:43">, 
 #<User id: 5, first: "foo1", last: "baasdasdr", email: "abc@example.com", created_at: "2010-12-30 17:20:49", updated_at: "2010-12-30 17:20:49">]

For example, the first element in that array shows one user with "foo" and "foo@example.com". The rest of them can be pulled out of the database as needed with a find.

> User.find(:all, :conditions => {:email => "foo@example.com", :first => "foo"})
 => [#<User id: 1, first: "foo", last: "bar", email: "foo@example.com", created_at: "2010-12-30 17:14:28", updated_at: "2010-12-30 17:14:28">, 
     #<User id: 3, first: "foo", last: "barney", email: "foo@example.com", created_at: "2010-12-30 17:14:43", updated_at: "2010-12-30 17:14:43">]

And it also seems like you'll want to add some better validation to your code to prevent duplicates in the future.

Edit:

If you need to use the big hammer of find_by_sql, because Rails 2.2 and earlier didn't support :having with find, the following should work and give you the same array that I described above.

User.find_by_sql("select * from users group by first,email having count(*) > 1")

After some googling, I ended up with this:

ActiveRecord::Base.connection.execute(<<-SQL).to_a
  SELECT 
    variants.id, variants.variant_no, variants.state 
  FROM variants INNER JOIN (
    SELECT 
      variant_no, state, COUNT(1) AS count 
    FROM variants
    GROUP BY 
      variant_no, state HAVING COUNT(1) > 1
  ) tt ON 
    variants.variant_no = tt.variant_no 
    AND variants.state IS NOT DISTINCT FROM tt.state;
SQL

Note that part that says IS NOT DISTINCT FROM, this is to help deal with NULL values, which can't be compared with equals sign in postgres.


If you are going the route of @hakunin and creating a query manually, you may wish to use the following:

ActiveRecord::Base.connection.exec_quey(<<-SQL).to_a
  SELECT 
    variants.id, variants.variant_no, variants.state 
  FROM variants INNER JOIN (
    SELECT 
      variant_no, state, COUNT(1) AS count 
    FROM variants
    GROUP BY 
      variant_no, state HAVING COUNT(1) > 1
  ) tt ON 
    variants.variant_no = tt.variant_no 
    AND variants.state IS NOT DISTINCT FROM tt.state;
SQL

The change is replacing connection.execute(<<-SQL) with connection.exec_query(<<-SQL)

There can be a problem with memory leakage using execute

Plead read Clarify DataBaseStatements#execute to get an in depth understanding of the problem.


Need Your Help

Python PEP8: Blank lines convention

python pep8

I am interested in knowing what is the Python convention for newlines between the program parts? For example, consider this:

Entity Framework 5 mixing Oracle & SQL Server

.net sql-server oracle entity-framework

I have 2 data layer DLL projects. One is hitting Oracle tables and the other SQL Server. Each one works on their own when used in separate projects.