Finding anagrams for a given word

Two words are anagrams if one of them has exactly same characters as that of the another word.

Example : Anagram & Nagaram are anagrams (case-insensitive).

Now there are many questions similar to this . A couple of approaches to find whether two strings are anagrams are :

1) Sort the strings and compare them.

2) Create a frequency map for these strings and check if they are the same or not.

But in this case , we are given with a word (for the sake of simplicity let us assume a single word only and it will have single word anagrams only) and we need to find anagrams for that.

Solution which I have in mind is that , we can generate all permutations for the word and check which of these words exist in the dictionary . But clearly , this is highly inefficient. Yes , the dictionary is available too.

So what alternatives do we have here ?

I also read in a similar thread that something can be done using Tries but the person didn't explain as to what the algorithm was and why did we use a Trie in first place , just an implementation was provided that too in Python or Ruby. So that wasn't really helpful which is why I have created this new thread. If someone wants to share their implementation (other than C,C++ or Java) then kindly explain it too.

Answers


Example algorithm:

Open dictionary
Create empty hashmap H
For each word in dictionary:
  Create a key that is the word's letters sorted alphabetically (and forced to one case)
  Add the word to the list of words accessed by the hash key in H

To check for all anagrams of a given word:

Create a key that is the letters of the word, sorted (and forced to one case)
Look up that key in H
You now have a list of all anagrams

Relatively fast to build, blazingly fast on look-up.


I came up with a new solution I guess. It uses the Fundamental Theorem of Arithmetic. So the idea is to use an array of the first 26 prime numbers. Then for each letter in the input word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list. All anagrams will have the same signature because

Any integer greater than 1 is either a prime number, or can be written as a unique product of prime numbers (ignoring the order).

Here's the code. I convert the word to UPPERCASE and 65 is the position of A which corresponds to my first prime number:

private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
        37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103,
        107, 109, 113 };

This is the method:

 private long calculateProduct(char[] letters) {
    long result = 1L;
    for (char c : letters) {
        if (c < 65) {
            return -1;
        }
        int pos = c - 65;
        result *= PRIMES[pos];
    }
    return result;
}

We know that if two words don't have the same length, they are not anagrams. So you can partition your dictionary in groups of words of the same length.

Now we focus on only one of these groups and basically all words have exactly the same length in this smaller universe.

If each letter position is a dimension, and the value in that dimension is based on the letter (say the ASCII code). Then you can calculate the length of the word vector.

For example, say 'A'=65, 'B'=66, then length("AB") = sqrt(65*65 + 66*66). Obviously, length("AB") = length("BA").

Clearly, if two word are anagrams, then their vectors have the same length. The next question is, if two word (of same number of letters) vectors have the same length, are they anagrams? Intuitively, I'd say no, since all vectors with that length forms a sphere, there are many. Not sure, since we're in the integer space in this case, how many there are actually.

But at the very least it allows you to partition your dictionary even further. For each word in your dictionary, calculate the vector's distance: for(each letter c) { distance += c*c }; distance = sqrt(distance);

Then create a map for all words of length n, and key it with the distance and the value is a list of words of length n that yield that particular distance.

You'll create a map for each distance.

Then your lookup becomes the following algorithm:

  1. Use the correct dictionary map based on the length of the word
  2. Compute the length of your word's vector
  3. Lookup the list of words that match that length
  4. Go through the list and pick the anagrams using a naive algorithm is now the list of candidates is greatly reduced

Well Tries would make it easier to check if the word exists. So if you put the whole dictionary in a trie:

http://en.wikipedia.org/wiki/Trie

then you can afterward take your word and do simple backtracking by taking a char and recursively checking if we can "walk" down the Trie with any combiniation of the rest of the chars (adding one char at a time). When all chars are used in a recursion branch and there was a valid path in the Trie, then the word exists.

The Trie helps because its a nice stopping condition: We can check if the part of a string, e.g "Anag" is a valid path in the trie, if not we can break that perticular recursion branch. This means we don't have to check every single permutation of the characters.

In pseudo-code

checkAllChars(currentPositionInTrie, currentlyUsedChars, restOfWord)
    if (restOfWord == 0)
    {
         AddWord(currentlyUsedChar)
    }
    else 
    {
        foreach (char in restOfWord)
        {
            nextPositionInTrie = Trie.Walk(currentPositionInTrie, char)
            if (nextPositionInTrie != Positions.NOT_POSSIBLE)
            {
                checkAllChars(nextPositionInTrie, currentlyUsedChars.With(char), restOfWord.Without(char))
            }
        }   
    }

Obviously you need a nice Trie datastructure which allows you to progressively "walk" down the tree and check at each node if there is a path with the given char to any next node...


static void Main(string[] args)
{

    string str1 = "Tom Marvolo Riddle";
    string str2 = "I am Lord Voldemort";

    str2=  str2.Replace(" ", string.Empty);
    str1 = str1.Replace(" ", string.Empty);
    if (str1.Length != str2.Length)
        Console.WriteLine("Strings are not anagram");
    else
    {
        str1 = str1.ToUpper();
        str2 = str2.ToUpper();
        int countStr1 = 0;
        int countStr2 = 0;
        for (int i = 0; i < str1.Length; i++)
        {
            countStr1 += str1[i];
            countStr2 += str2[i];

        }
        if(countStr2!=countStr1)
            Console.WriteLine("Strings are not anagram");
        else Console.WriteLine("Strings are  anagram");

    }
    Console.Read();
}

  • Reduce the words to - say - lower case (clojure.string/lower-case).
  • Classify them (group-by) by letter frequency-map (frequencies).
  • Drop the frequency maps,
  • ... leaving the collections of anagrams.

(These) are the corresponding functions in the Lisp dialect Clojure.

The whole function can be expressed so:

(defn anagrams [dict]
  (->> dict
       (map clojure.string/lower-case)
       (group-by frequencies)
       vals))

For example,

(anagrams ["Salt" "last" "one" "eon" "plod"])
;(["salt" "last"] ["one" "eon"] ["plod"])

An indexing function that maps each thing to its collection is

(defn index [xss]
  (into {} (for [xs xss, x xs] [x xs])))

So that, for example,

((comp index anagrams) ["Salt" "last" "one" "eon" "plod"])
;{"salt" ["salt" "last"], "last" ["salt" "last"], "one" ["one" "eon"], "eon" ["one" "eon"], "plod" ["plod"]}

... where comp is the functional composition operator.


Generating all permutations is easy, I guess you are worried that checking their existence in the dictionary is the "highly inefficient" part. But that actually depends on what data structure you use for the dictionary: of course, a list of words would be inefficient for your use case. Speaking of Tries, they would probably be an ideal representation, and quite efficient, too.

Another possibility would be to do some pre-processing on your dictionary, e.g. build a hashtable where the keys are the word's letters sorted, and the values are lists of words. You can even serialize this hashtable so you can write it to a file and reload quickly later. Then to look up anagrams, you simply sort your given word and look up the corresponding entry in the hashtable.


That depends on how you store your dictionary. If it is a simple array of words, no algorithm will be faster than linear.

If it is sorted, then here's an approach that may work. I've invented it just now, but I guess its faster than linear approach.

  1. Denote your dictionary as D, current prefix as S. S = 0;
  2. You create frequency map for your word. Lets denote it by F.
  3. Using binary search find pointers to start of each letter in dictionary. Lets denote this array of pointers by P.
  4. For each char c from A to Z, if F[c] == 0, skip it, else
    • S += c;
    • F[c] --;
    • P <- for every character i P[i] = pointer to first word beginning with S+i.
    • Recursively call step 4 till you find a match for your word or till you find that no such match exists.

This is how I would do it, anyway. There should be a more conventional approach, but this is faster then linear.


tried to implement the hashmap solution

public class Dictionary {

    public static void main(String[] args){

    String[] Dictionary=new String[]{"dog","god","tool","loot","rose","sore"};

    HashMap<String,String> h=new HashMap<String, String>();

    QuickSort q=new QuickSort();

    for(int i=0;i<Dictionary.length;i++){

        String temp =new String();

        temp= q.quickSort(Dictionary[i]);//sorted word e.g dgo for dog

        if(!h.containsKey(temp)){
           h.put(temp,Dictionary[i]);
        }

        else
        {
           String s=h.get(temp);
           h.put(temp,s + " , "+ Dictionary[i]);
        }
    }

    String word=new String(){"tolo"};

    String sortedword = q.quickSort(word);

    if(h.containsKey(sortedword.toLowerCase())){ //used lowercase to make the words case sensitive

        System.out.println("anagrams from Dictionary   :  " + h.get(sortedword.toLowerCase()));
    }

}

  • Compute the frequency count vector for each word in the dictionary, a vector of length of the alphabet list.
  • generate a random Gaussian vector of the length of the alphabet list
  • project each dictionary word's count vector in this random direction and store the value (insert such that the array of values is sorted).

  • Given a new test word, project it in the same random direction used for the dictionary words.

  • Do a binary search to find the list of words that map to the same value.
  • Verify if each word obtained as above is indeed a true anagram. If not, remove it from the list.
  • Return the remaining elements of the list.

PS: The above procedure is a generalization of the prime number procedure which may potentially lead to large numbers (and hence computational precision issues)


One solution is - Map prime numbers to alphabet characters and multiply prime number

For ex - 

    a -> 2
    b -> 3
    ......
    .......
    ......
    z -> 101

So

'ab' -> 6
'ba' -> 6
'bab' -> 18
'abba' -> 36
'baba' -> 36

Get MUL_number for Given word. return all the words from dictionary which have same MUL_number as given word


First check if the length of the strings are the same. then check if the sum of the characters in both the strings are same (ie the ascii code sum) then the words are anagrams else not an anagram


Need Your Help

Two floated columns - one fixed, one loose width

html css css-float

I've looked around SO, but I cannot find one that matches my occurrence, I basically have two columns one fixed width (185px) and the other column has no fixed width, however I need the last column...

Error adding geofences in Android (status code 1000)

android geolocation google-play-services android-location android-geofence

I am getting an error in the onAddGeofencesResult(int statusCode, String[] geofenceRequestIds) callback with statusCode = 1000.