From Lupyan_Lab_Wiki
Jump to: navigation, search


Fun with strings[edit]

Inside the Commons\exercise_9 folder is a corpus of tv-show subtitles from a bunch of US and UK TV shows. Please write code to answer the following questions.

  1. How many unique words are there in the US corpus? In the UK corpus? Hint, you probably want to concatenate all the files from each corpus into a single file so that you're working with two files, not 100s. See here to refresh your command line skills.
  2. List all the words that are identical but have a longer spelling in the UK vs the US corpus (e.g., color/colour, behavior/behaviour, acknowledgment/acknowledgement). Now list all the words that have a longer spelling in US compared to UK english.
  3. What is the average word length in the US corpus? In the UK corpus?
  4. What are the 10 words that are relatively more frequent in the UK compared to the US texts? What are the 10 words that are relatively more frequent in the US compared to the UK texts?
    1. The definition of relative frequency is: frequency(Word)/number_of_total_words_in_the_corpus (Earlier version said number of unique words This is wrong). To get the words that are relatively more frequent in US than the UK corpus, you'll want to subtract the two frequencies and output the 10 words with the largest differences. You'll want to ignore any words that only occur in one of the corpora (e.g., there's 'colouring' in the UK but not the US corpus, so it wouldn't make sense to compare its relative frequency).
  5. Let's now redo the analysis above, but let's include only relatively more frequent words -- those with frequencies of 40 or more. That is, do the analysis above only on words that occur with a frequency of >=40 in both the US and the UK corpus.
  6. Bonus: Redo the analysis above for types instead of tokens. This means that go/went/going would all count as frequencies for 'go', cat/cats/cat's would all count as frequencies for 'cat'. You can use the tokenizing functions from the NLTK library OR get the token-to-type conversion in the ANC-written-count_over9.txt file inside common\demos\regex. If you use the ANC file to do the token-to-type conversion ignore words that occur in the subtitles files but not in the ANC file.

All of these questions have objectively correct answers, but there are a variety of ways to get to them. You are free to use simple string matching, regular expressions, the NLTK library, and any standard Python libraries you think would be handy. For example, the Counter class from the collections library may be pretty handy, as is the difflib library I mentioned in class. NLTK has functions that will allow you to do this exercise using just a few lines of code.