Remove repeating characters from words

I was wondering what is the best way to convert something like "haaaaapppppyyy" to "haappyy".

Basically, when parsing slang, people sometimes repeat characters for added emphasis.

I was wondering what the best way to do this is? Using set() doesn't work because the order of the letters is obviously important.

Any ideas? I'm using Python + nltk.

Answers


It can be done using regular expressions:

>>> import re
>>> re.sub(r'(.)\1+', r'\1\1', "haaaaapppppyyy")     
'haappyy'

(.)\1+ repleaces any character (.) followed by one or more of the same character (because of the backref \1 it must be the same) by twice the character.


You can squash multiple occurrences of letters with itertools.groupby:

>>> ''.join(c for c, _ in groupby("haaaaapppppyyy"))
'hapy'

Similarly, you can get haappyy from groupby with

>>> ''.join(''.join(s)[:2] for _, s in groupby("haaaaapppppyyy"))
'haappyy'

You should do it without reduce or regexps:

>>> s = 'hhaaaaapppppyyy'
>>> ''.join(['' if i>1 and e==s[i-2] else e for i,e in enumerate(s)])
'haappyy'

The number of repetitions are hardcoded to >1 and -2 above. The general case:

>>> reps = 1
>>> ''.join(['' if i>reps-1 and e==s[i-reps] else e for i,e in enumerate(s)])
'hapy'

This is one way of doing it (limited to the obvious constraint that python doesn't speak english).

>>> s="haaaappppyy"
>>> reduce(lambda x,y: x+y if x[-2:]!=y*2 else x, s, "")
'haappyy'

For the thing you mentioned about the set(), you can use collections.OrderedDict to maintain the order of the letters. So use:

text = "happy"
print(list(OrderedDict.fromkeys(text)))

which will give you:

['h', 'a', 'p', 'y']

Need Your Help

To MVVM or not to MVVM that is the question

wpf mvvm drag-and-drop

I am rewriting my windows forms based application and I am going to use WPF.

difference between WH_KEYBOARD and WH_KEYBOARD_LL?

c++ c winapi hook wh-keyboard-ll

what is the difference between the working of two ? For WH_KEYBOARD_LL i read that it Installs a hook procedure that monitors low-level keyboard input events. What is meant by low-level keyboard e...