Regular expressions

Regular expressions are an immensely powerful tool built into most modern computer languages. They are a type of formal grammar that allow you to match strings that match or mismatch a particular rule. http://www.rexegg.com/regex-uses.html include checking if user input conforms to a desired pattern (e.g., 3 numbers followed by two numbers, followed by 3 numbers), do all sorts of complicated search-replace operations both in text-files and, e.g., renaming files.

There are https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124 written on regular expressions as well as comprehensive online references. We'll only concern ourselves with a few basics here.

Start by looking over the first couple lessons of this tutorial, paying special attention to the sidebar on right, which I reproduce below.

Syntax Meaning
abc… Literal letters
\d Any Digit
\D Any Non-digit character
. Any Character
. Period (slash is an escape character)
[abc] Only a, b, or c
[^abc] Not a, b, nor c
[a-z] Characters a to z
[0-9] Numbers 0 to 9
\w Any Alphanumeric character
\W Any Non-alphanumeric character
{m} m Repetitions
{m,n} m to n Repetitions
* Zero or more repetitions
+ One or more repetitions
? Optional character
\s Any Whitespace
\S Any Non-whitespace character
^…$ Starts and ends
(…) Capture Group
(a(bc)) Capture Sub-group
(.*) Capture all
(abc|def) Matches abc or def

Use as a filter

Let's begin by reading in a file containing a bunch of words from the American National Corpus that have a frequency of at least 9. Here's a sample of what this file looks like.

word    lemma    pos    freq
the    the    DT    1081168
of    of    IN    539793
and    and    CC    466737
to    to    TO    448519
a    a    DT    406057
in    in    IN    360853
is    be    VBZ    192975

For those unfamiliar with language lingo, English lemmas are basically the word-stems, e.g., the lemma of cars is car; the lemma of walking is walk. pos stands for part of speech.

Just to illustrate an alternate way of dealing with csv file, let's use the csv module:

import csv

csvfile = open('datasets/ANC-written-count_over9.txt', 'r')
data = csv.DictReader(csvfile, delimiter='\t')

data is a DictReader object which is kind of like a list of dictionaries of the kind we made when reading in trial files, except unlike a list we can't access a particular element, but have to iterate through it. The advantage of this is that we don't need to hold the entire dataset in memory at the same time (important for very large datasets, but not an issue in the present case).

Since memory is not an issue here, let's put all the unique words. We don't care about frequencies for this exercise.

import re #import the python regexp module

words = set([row['word'] for row in data])
print "We have", len(words), "words"
We have 48318 words

Now let's use some regular expressions starting with simple ones, and moving on to every slightly more complicated ones.

Grab words beginning with q

[curWord for curWord in words if re.findall('^q',curWord)]
Out[155]:
['quickedit',
 'qualifications',
 'queue',
 'quadrant',
 'quarter',
 'quartet',
 'qol',
 'queues',
 'qida',
 'quicker',
 'quixtar',
 'quadrupled',
 'quaint',
 'quarterfinal',
 'quixote',
 'quarry',
 'quickened',
 'quintiles',
 'quarries',
 'qur',
 'qua',
 'quivering',
 'quenched',
 'quentin',
 'quilts',
 'qualifying',
 'queensland',
 'quartets',
 'quantitation',
 'questioner',
 'questioned',
 'quartiles',
 'quarrels',
 'quibble',
 'quinones',
 'qaida',
 'quine',
 'quick-edit',
 'quoting',
 'qrna',
 'queens',
 'quarterbacks',
 'questioning',
 'qiagen',
 'quiche',
 'qiaquick',
 'qrs',
 'qus',
 'que',
 'qui',
 'quo',
 'quadrangle',
 'quintet',
 'qingdao',
 'qualifies',
 'qualifier',
 'qualified',
 'quiet',
 'quantiles',
 'qtls',
 'qualcomm',
 'quetzalcoatl',
 'questions',
 'quips',
 'quso',
 'qualifiers',
 'questionnaires',
 'quiver',
 'quark',
 'quart',
 'qassam',
 'quickness',
 'quote',
 'quota',
 'qadir',
 'quarters',
 'qatar',
 'q\xe9\xe5\xfc\xe5\xe7\xe4\xe7\xf6\xf3',
 'qios',
 'quantified',
 'quarter-century',
 'qaddafi',
 'quantities',
 'quantitate',
 'question',
 'qutb',
 'quotient',
 'quarterfinals',
 'quintanilla',
 'query',
 'quakes',
 'quaker',
 'quarrel',
 'quitting',
 'quds',
 'quaid',
 'quail',
 'quenching',
 'quantification',
 'qp',
 'qu',
 'qa',
 'qb',
 'qd',
 'quadruple',
 'quashed',
 'quantitative',
 'quirk',
 'quickie',
 'quiescent',
 'queasy',
 'qpak',
 'quixotic',
 'quantitated',
 'quagmire',
 'qualify',
 'quizzes',
 'qu\xe9bec',
 'quests',
 'quartops',
 'q\xe7\xed',
 'quecreek',
 'quantitatively',
 'quantile',
 'quieter',
 'quantifying',
 'quintessential',
 'qrt-pcr',
 'quay',
 'quai',
 'quartz',
 'quality-of-life',
 'qspline',
 'quayle',
 'quell',
 'quotations',
 'qualms',
 'quit',
 'quip',
 'quiz',
 'quid',
 'quotation',
 'quibbles',
 'quindlen',
 'quantify',
 'quench',
 'quadrants',
 'quran',
 'quintana',
 'queen',
 'quad',
 'qwest',
 'quinta',
 'quickly',
 'questionable',
 'quasi',
 'quash',
 'quigley',
 'quarreling',
 'quartop',
 'qing',
 'quake',
 'queried',
 'queries',
 'quadratic',
 'qaeda',
 'quays',
 'qio',
 'qin',
 'quandary',
 'quite',
 'quits',
 'quack',
 'quest',
 'quarterback',
 'quayside',
 'qtc',
 'qtl',
 'quickest',
 'qe',
 'quinn',
 'quick-line',
 'q-pna',
 'quoted',
 'quotes',
 'qsp',
 'quinean',
 'questionnaire',
 'quotas',
 'quantum',
 'querying',
 'qureshi',
 'quality',
 'quartile',
 'quenk',
 'quilt',
 'qualitatively',
 'quill',
 'quot',
 'qu\xe9b\xe9cois',
 'qualities',
 'quasispecies',
 'quintile',
 'q',
 'quantifiable',
 'quackery',
 'quinone',
 'quietly',
 'quickening',
 'quebec',
 'qianlong',
 'quincy',
 'quarterly',
 'qualification',
 'quilted',
 'qualitative',
 'quilting',
 'quotable',
 'queer',
 'quick',
 'qt',
 'quirks',
 'question-and-answer',
 'quirky',
 'quantity',
 'quintas',
 'quipped',
 'qeis',
 'quidditch',
 'quaternary']

Grab all words begin with an a and end with an i

[curWord for curWord in words if re.findall('^a\w+i$',curWord)]
Out[156]:
['agassi',
 'asci',
 'adlai',
 'antoni',
 'assisi',
 'armani',
 'andrei',
 'anti',
 'audi',
 'alexei',
 'ascii',
 'alibi',
 'api',
 'ajami',
 'aulaqi',
 'agnelli',
 'abdelghani',
 'acini',
 'ami',
 'afi',
 'alumni',
 'alveoli',
 'arabi',
 'agouti',
 'arundhati',
 'asahi',
 'aci',
 'ashkenazi',
 'adi',
 'amalfi',
 'ali',
 'accompli',
 'ambani',
 'afghani',
 'abi',
 'avi',
 'ani',
 'ari',
 'abyssi']

Grab all words that begin with an a, followed by 4-6 letters and and on an i

[curWord for curWord in words if re.findall('^a\w{4,6}i$',curWord)]
Out[157]:
['agassi',
 'antoni',
 'assisi',
 'armani',
 'andrei',
 'alexei',
 'aulaqi',
 'agnelli',
 'alumni',
 'alveoli',
 'agouti',
 'amalfi',
 'accompli',
 'ambani',
 'afghani',
 'abyssi']

Grab words that start with a l, end on an t, and contain a t somewhere in the middle

[curWord for curWord in words if re.findall('^b\w+t\w+t$',curWord)]
Out[158]:
['blastocyst',
 'brightest',
 'bioterrorist',
 'bittorrent',
 'blatant',
 'backstreet',
 'butaprost',
 'baptist',
 'bestest',
 'bartlett',
 'bittersweet',
 'betterment',
 'butait',
 'batshit']

Let's say we want to exclude words that end on two ts.

[curWord for curWord in words if re.findall('^b\w+t\w+[^tt]t$',curWord)]
Out[159]:
['blastocyst',
 'brightest',
 'bioterrorist',
 'bittorrent',
 'blatant',
 'backstreet',
 'butaprost',
 'baptist',
 'bestest',
 'bittersweet',
 'betterment',
 'butait',
 'batshit']

Let's get all the words containing the vowels a, e, i, o, in that order

[curWord for curWord in words if re.findall('\w+a+\w+e+\w+i+\w+o+',curWord)]
Out[160]:
['intraperitoneally',
 'characterizations',
 'catheterization',
 'characterization',
 'cardiorespiratory',
 'categorization',
 'carvedilol',
 'compartmentalization',
 'intraperitoneal',
 'campesino',
 'parameterization',
 'chloramphenicol']

You know that saying i before e except after c (in which case it's i after e, like receive). Let's see how well this mnemonic holds up.

Let's find out how many words there are that have ie vs. ei in them.

print "ie words:", len([curWord for curWord in words if re.findall('ie',curWord)])
print "ei words:", len([curWord for curWord in words if re.findall('ei',curWord)])
ie words: 1439
ei words: 483

Now let's check what happens when we check for a 'c' preceding ie/ei

print "cie words:", len([curWord for curWord in words if re.findall('cie',curWord)])
print "cei words:", len([curWord for curWord in words if re.findall('cei',curWord)])
cie words: 107
cei words: 33

There are actually more words that violate the mnemonic than those that obey it! What are these words?

[curWord for curWord in words if re.findall('cie',curWord)]
Out[163]:
['contingencies',
 'democracies',
 'species-specific',
 'societal',
 'consciences',
 'insufficiently',
 'tumefaciens',
 'deficiency',
 'sucient',
 'dependencies',
 'francie',
 'marcie',
 'self-sufficient',
 'gracie',
 'vacancies',
 'inefficiencies',
 'inefficient',
 'inadequacies',
 'omniscient',
 'currencies',
 'proficient',
 'deficiencies',
 'frequencies',
 'sciences',
 'potencies',
 'subspecies',
 'concierge',
 'scientia',
 'unscientific',
 'science-fiction',
 'delicacies',
 'science',
 'newscientist',
 'efficiencies',
 'efficiently',
 'malignancies',
 'conscience',
 'redundancies',
 'conscientious',
 'prophecies',
 'bureaucracies',
 'pregnancies',
 'ancients',
 'constituencies',
 'prescience',
 'insufficiency',
 'discrepancies',
 'prescient',
 'intricacies',
 'sufficient',
 'interspecies',
 'societies',
 'glacier',
 'policies',
 'scientists',
 'lucie',
 'glaciers',
 'scientology',
 'tendencies',
 'deficient',
 'species',
 'btk-deficient',
 'coefficient',
 'fancies',
 'fancied',
 'coefficients',
 'efficient',
 'energy-efficient',
 'ancient',
 'neuroscientist',
 'aberrancies',
 'conspiracies',
 'self-sufficiency',
 'agencies',
 'financiers',
 'conscientiously',
 'financier',
 'scientifically',
 'bankruptcies',
 'scientific',
 'society',
 'proficiency',
 'biosciences',
 'sufficiently',
 'insufficient',
 'pricier',
 'candidacies',
 'emergencies',
 'inefficiency',
 'saucier',
 'fancier',
 'sufficiency',
 'efficiency',
 'suycient',
 'neuroscience',
 'quasispecies',
 'pharmacies',
 'science-based',
 'hacienda',
 'legacies',
 'inaccuracies',
 'efficacies',
 'immunodeficiency',
 'scientist',
 'cross-species',
 'competencies',
 'inconsistencies']

Here's a tricky one. Let's find words containing 4 rs (interspersed among other letters). One way to do this is to explicitly specify it... any character, r, any character, r.. etc. Like so..

[curWord for curWord in words if re.findall('\w*r+\w*r+\w*r+\w*r+\w*',curWord)]
Out[164]:
['extraterrestrials',
 'counterterrorism',
 'counterterrorist',
 'extracurricular',
 'extraterrestrial',
 'refrigerators',
 'grrrr',
 'cardiorespiratory',
 'refrigerator']

There are two shortcomings to this approach. The first is that if we want 3 or 5 matches, we need to explicitly remove or add code rather than changing a single number-of-matches parameter. Another shortcoming is that hyphenated words are excluded. We can add hyphens by replacing \w with [a-z\-], but that makes the expression even longer. Here's a better solution:

[curWord for curWord in words if re.findall('([^r]*r[^r]*){4}$',curWord)]
Out[165]:
['extraterrestrials',
 'counterterrorism',
 'writer-director',
 'reverse-transcribed',
 'counterterrorist',
 'extracurricular',
 'extraterrestrial',
 'refrigerators',
 'grrrr',
 'cardiorespiratory',
 'corporate-reform',
 'counter-terrorism',
 'antiretroviral-experienced',
 'reverse-transcription',
 'refrigerator']

Let's unpack that. We are matching a group which is demarcated by parentheses. The group pattern is: not-an-r (0 or more times), an r, and then not-an-r (0 or more times). We want words that match this pattern exactly 4 times. That gives us all the words containing four rs and anything in between them (including nothing, hence grrrr)

Use in place of conditionals

Let's say we want to check whether an entered word is color or the British colour. We can do this with a conditional (if "color" or "colour"), but we can also use regular expressions (which scale much better than conditionals). For example:

re.findall('colou?r','The British like to colour their colors')
Out[166]:
['colour', 'color']

Play around with this:

variousWords = re.compile('[d|c][a|o][g|t]')
variousWords.match('cog').group()
variousWords.match('cag').group()
Out[167]:
'cag'

Here are some more examples.

#will match any numbers
anyNums = re.compile('[0-9]+')
anyNums.findall('There are 99 bottles of beer on the wall. 999....') #will return all matches
anyNums.search('There are 99 bottles of beer on the wall. 999....').group() #will return just the first occurrence

#two digit numbers from 00 to 59 or 80 to 89 
someNums = re.compile('[0-5][0-9]|[8][0-9]')
matches = [someNums.search(x).group() for x in 'It will match 54, 52, and 88, but not 7 or 92 or any of the letters'.split(' ') if someNums.search(x)]
matches

#We don’t need to compile regular expressions using re.compile. but it speeds things up when using the same rule over a large corpus.

emailRegExGrouped = re.compile('([\w.-]+)@([\w.-]+)')
#the parenthesis allow us to access groups -- the first group corresponds to the first matched part (before the @). The second group to the domain (e.g., wisc.edu(

emailRegExGrouped.search('g.lupyan@gmail.com').groups()
('g.lupyan', 'gmail.com')

emailRegExGrouped.findall('g.lupyan@gmail.com lupyan@wisc.edu')
#returns [('g.lupyan', 'gmail.com'), ('lupyan', 'wisc.edu')]

#to get all the domains:
[email[1] for email in emailRegExGrouped.findall('g.lupyan@gmail.com lupyan@wisc.edu')]
#returns ['gmail.com', 'wisc.edu']
Out[168]:
['gmail.com', 'wisc.edu']

Search and replace

All good text editors allow you to use regular expressions in search and replace. A simple usage case is searching for lines that begin or end with a certain character sequence. To find lines that begin with "ab", search for ^ab. To find lines that end on ies search for ies$

Make sure to enable regular-expression search by clicking on .* in Sublime text or checking the appropriate box (sometimes labeled Grep) in other text editors

When using regular expressions in search/replace, it becomes useful to use matching groups.

For example, suppose you want to replace the occurrences of the following strings, which occur at the start of each line:

bdSubjCode_130
badSbjCode_131
baSubjCode_132
badSubjCode_133
badubjCode_134
BadSubjCode_135

with

MYSUBJCODE_130
MYSUBJCODE_131
MYSUBJCODE_132
MYSUBJCODE_133
MYSUBJCODE_134
MYSUBJCODE_135

You could manually do search replaces for each one. But if you have a hundred of these, that gets tedious fast and is a recipe for errors.

Here's a much better solution. Simply search for:

(^\w+_)([0-9]+)

and replace with

MYSUBJCODE_\2

The \2 refers to the second group, i.e., the number

Here's another example. Delete all the lines that start with some letters and end in 'ing':

Search: (^\w+ing).* Replace with: nothing

Now you can do another search and replace, searching for \n+ and replacing with \n To get rid of multiple newlines that the first search/replace may have created.

Renaming files

Use the same logic to rename files. You can do this in python or by using GUI programs like NameChanger (Mac), or Bulk Rename (PC). These programs allow you to do batch renaming of files using simple search/replace (e.g., replace _ with - or through search/replace augmented by regular expressions.