Extracting and analyzing subtitles

From Lupyan Lab
Jump to: navigation, search


Contents

Notes

I compiled all of this code on my computer, using my own system of organization. Most importantly, my corpus for each of the countries I was analyzing was in a different folder, and my naming convention relied heavily on the names of the countries so many functions have to be run independently for each country. Since much of it was compiled ad hoc, the system isn't ideal, and most of it is just part of the code itself (which I know is a stylistic taboo, but oh well). I essentially had a bunch of different versions of my subtitles, in various formats (original format, original format without timestamps, no newlines, all lowercase, etc), which were organized in folders with different names ("full", "unzipped" (I hadn't realized I'd need a naming convention yet), "sansnewline", and "lowercase", respectively). These names usually go in the "source" spot in the arguments of functions in the code.

I can provide a diagram of my system if people want. That said, it shouldn't be too hard to switch those lines up to work with whatever system you use.

Also, I made use of pythonopensubtitles and NLTK, so both are necessary for many of the functions. In addition, I made frequent use of the readFile() and writeFile() code from CMU's 15-112 course documentation because I knew they worked.

import urllib
import os
 
def writeFile(filename, contents, mode="wt"):
    # wt stands for "write text"
    fout = None
    try:
        fout = open(filename, mode)
        fout.write(contents)
    finally:
        if (fout != None): fout.close()
    return True
 
def readFile(filename, mode="rt"):
    # rt stands for "read text"
    fin = contents = None
    try:
        fin = open(filename, mode)
        contents = fin.read()
    finally:
        if (fin != None): fin.close()
    return contents

Since in this experiment we were trying to test the compressibility of American (US) vs. British (UK) English by using the subtitles of TV shows, I'll be talking a lot about episodes as my basic unit of information. This could also be done with movies or other subtitle-containing items, but some stuff (mostly naming conventions and elements of the subtitle downloader) would probably have to be changed.

Extracting Subtitles

Collecting Show Information

In order to perform analyses on the language contained in episodes of TV shows, we need to have a basic amount of information about those episodes. Also, pythonopensubtitles requires the IMDB ID to properly search for shows, and that's kind of a pain to isolate. The following is a web scraper for an IMDB API, which very helpfully collects the relevant information from IMDB for us.

import json, os, urllib, time
 
def get_info_list(tv_list, hasID = False):
    """
    creates a list that holds all the info about each show, with the imdb id easily accessible
    tv_list = list of tv shows. each item contains a string in the format "black+books," where non-alphanumeric characters are replaced.
    hasID = True if the list is made up of tuples in the format ("show+title+in+url+format", "imdbID"), False if it's just the titles.
    """
    imdb_info_list = { }
    for show in tv_list:
        #scrape the info off the internet using the omdbapi.com and add it to imdb_info_list
        time.sleep(1)
        if hasID: (title, imdb) = show
        else: title = show
        showinfo = scrape_tvshow(show, hasID) #create an entry in imdb_info_list
        imdb_info_list[title] = showinfo
        imdb_info_list[title]['filename'] = title
        #print show_list
        print 'added ' + title
    print 'DONE SCRAPING'
    return imdb_info_list
 
def scrape_tvshow(show, hasID = False):
    """
    creates an entry in the data structure for the specific show by searching it in the API and scraping the resulting information
    """
    if hasID:
        #if you're using the IMDB ID specifically to search
        (title, imdb) = show
        url = "http://www.omdbapi.com/?i=%s" % (imdb)
    else:
        #if you're just using the title to search
        url = "http://www.omdbapi.com/?t=%s" % (show)
    contents = read_webpage(url) #contents is now a string
    showinfo = json.loads(contents) #showinfo is a dictionary
    return showinfo
 
def read_webpage(url):
    """
    scrapes a webpage
    code taken from course documentation for CMU's 15-112
    """
    assert(url.startswith("http://"))
    fin = contents = None
    try:
        fin = urllib.urlopen(url)
        contents = fin.read()
    finally:
        if (fin != None): fin.close()
    return contents

When I compiled my lists of shows, I just built them manually and searched for more information using only the titles. However, since there are multiple pieces of audiovisual media that go by the same title, sometimes this resulted in the scraper pulling the wrong shows. The best (and worst) way to find out about this is to just...assume things are correct unless you know about a possible conflict ahead of time, and wait for the downloader to throw an error.

After a first pass at trying to download subtitles, which will help you isolate shows that don't work, compile a new set of lists in the tuple format and run that through the scraper. I don't have code for adding the new dictionaries to the ones created by the first pass through get_info_list(), so I just did it manually.

Downloading Subtitles

The following functions are used to pull subtitles from http://www.opensubtitles.org/. Some of them refer to the previously defined scraper functions.

import os, time, urllib, json, xmlrpclib, zipfile
from pythonopensubtitles.opensubtitles import OpenSubtitles
opensub = OpenSubtitles()
 
def download_subs(info_list, country):
    """
    for each imdbid on a list, downloads subtitles from the internet
    """
    token = opensub.login('', '')
    assert type(token) == str
    for show in info_list:
        info = info_list[show]
        folder = 'subtitles/%ssubs/downloaded/%s' % (country, info['filename'])
        if not os.path.exists(folder):
            os.mkdir(folder)
        try:
            if int(info["Runtime"][0:2]) < 40: i = 3
            elif int(info["Runtime"][0:2]) >= 60: i = 1
            else: i = 2
        except: i = 2
        for x in xrange(i):
            num = x+1
            episode = num
            grab_sub(info, episode, country)
    print "DONE DOWNLOADING"
    return
 
def grab_sub(show, episode, country):
    path = "subtitles/%ssubs/downloaded/%s/%s1x0%d" % (country, show['filename'], show['filename'], episode)
    if os.path.exists(path):
        print '%s exists, moving on' % (path)
        return
    time.sleep(1)
    print 'downloading %s' % (path)
    XMLimdbID = show["imdbID"][2:9] #XMLlibrary only uses NUMBER part of imdbid
    data = opensub.search_subtitles([{'sublanguageid': 'eng', 'imdbid': XMLimdbID, 'season': '1', 'episode': str(episode)}])
    subinfo = data[0]
    downloadurl = subinfo['ZipDownloadLink']
    urllib.urlretrieve(downloadurl, path)
    assert zipfile.is_zipfile(path)
    print 'downloaded %s' % (path)
    return
 
def decompress_files(showinfo, country):
    #unzip each file in a folder
    directory = "subtitles/%ssubs/downloaded" % (country)
    dest = "subtitles/%ssubs/unzipped" % (country)
    for dirname in os.listdir(directory):
        if dirname not in showinfo: continue
        for filename in os.listdir(os.path.join(directory, dirname)):
            path = os.path.join(directory, dirname, filename)
            unzipfolder = os.path.join(dest, filename)
            if zipfile.is_zipfile(path):
                unzip(path, unzipfolder)
            else:
                print """
                ************
                NOT ZIP FILE: %s
                ************
                """ % (path)
    print 'DONE UNZIPPING'
    return

Downloading the subtitles is where things get interesting. The function requires you to provide it with a dictionary of information about each show, so it can determine how many episodes worth of subtitles to download. Assuming the show can be successfully found, the subtitles will be download as zipfiles to a "downloaded" folder that the function already expects to exist. Each show's episodes will be stored in a folder named for the show, which the function will create for you. The naming convention for each episode is "[showname][season#]x[episode#]," so the first episode of the first season of a show called "Black Books" would be "black+books1x01".

Most of the time, the downloader will fail for one of two reasons:

1: The scraper pulled the wrong show's info. If this is the case, get the IMDB ID from the IMDB page manually, and add the show to a tuple list to be run through the scraper again.

2: The first two episodes of the show weren't on opensubtitles.org. If there are other episodes of the same show, I just downloaded and decompressed them manually. If the show simply wasn't there, I found a replacement and added it to the tupled list.

Even though this process is mostly automated, it goes bad enough that it's better to just keep an eye on it. The code isn't very robust since I didn't know what I was doing at the time, so that could also pretty much definitely be improved.

All the files will be downloaded as .zips, so once you've got all the ones you need just make an "unzipped" folder and run decompres_files() to unzip them.

Editing

Converting To .txt Files and Cleanup

After decompression, you will be left with a series of folders for each episode. The folders (usually) contain at least one .srt file and at least one .nfo file. The .nfo files are basically useless to use, and we have to convert the .srt files to .txt before we can do anything with them. After converting the files to .txts, we still need to clean them up in order to get rid of all sorts of extraneous characters.

import os, shutil, re, files
 
def converttotxt(country):
    """
    convert a bunch of .srts to .txts so they can be edited, and move them to somewhere convenient
    country: string, country of the corpus to be converted to text
    """
    destfolder = "subtitles/%stxts/full" % (country)
    oldfolder = "subtitles/%ssubs/unzipped" % (country)
    if not os.path.exists(destfolder):
        os.mkdir(destfolder)
    for folder in os.listdir(oldfolder):
        #turn the .srt file into a .txt file, copy it and move it to a new place
        if folder[0] == ".": print folder, ", nope"; continue
        for filename in os.listdir(os.path.join(oldfolder, folder)):
            if filename[-3:] == "srt":
                src = os.path.join(oldfolder, folder, filename)
                print "converting %s to %s.txt" % (filename, folder)
                shutil.copyfile(src, os.path.join(destfolder, "%s.txt" % (folder)))
                break
    return
 
def cleanup(country, purgenewlines = False, lowercase = False):
    """
    remove extraneous characters/information from the subtitle files
    this got REALLY slow at one point and I'm not sure why, but it can be fixed by commenting stuff out of the loop
    """
    if purgenewlines: src = "subtitles/%stxts/sansnewline" % (country)
    elif lowercase: src = "subtitles/%stxts/lowercasenodict" % (country)
    else: src = "subtitles/%stxts/unzipped" % (country)
    timestamp = re.compile('^\d+\n?.*\n?', re.MULTILINE)
        #finds line numbers and the line after them (which usually houses timestamps)
    brackets = re.compile('\[[^]]*\]\n?|\([^)]*\)\n?|<[^>]*>\n?|\{[^}]*\}\n?')
        #finds brackets and anything in between them (sound effects)
    opensubs = re.compile('.*subtitles.*\n?|.*subs.*\n?', re.IGNORECASE)
        #finds the opensubtitles signature
    urls = re.compile('www.*\s\n?|[^\s]*\. ?com\n?')
        #finds any urls
    r = re.compile('\r')
        #gets rid of \r
    punctuation = re.compile("[^\w\s']")
        #finds punctuation
    n = re.compile('\n')
        #finds newlines
    for filename in os.listdir(src):
        print "fetching", filename
        content = files.readFile(os.path.join(src, filename))
        print "cleaning up", filename
        content = timestamp.sub('', content)
        content = brackets.sub('', content)
        content = opensubs.sub('', content)
        content = urls.sub('', content)
        content = r.sub('', content)
        content = punctuation.sub(' ', content)
        if purgenewlines: content = n.sub(' ', content)
        if lowercase: content = content.lower()
        files.writeFile(os.path.join(src, filename), content)
    print "DONE"
    return

converttotxt() will copy and convert one .srt from every episode for a country to a .txt, moving the new file to a user-specified destination. cleanup() will eliminate the timestamps, line numbers, urls, sound effects, and the opensubtitles signature. It also has options to get rid of newlines and convert all text to lowercase.

cleanup() does not create new copies of the files, so if you want sets of subtitles in different formats make sure to copy your folders and call the function on the right one.

For reference, un-cleaned up subtitles are in the following format:

   2
   00:00:08,967 --> 00:00:10,552
   Excuse me.
   There's a line, buddy.
   
   3
   00:00:10,719 --> 00:00:11,512
   There's two lines.
   
   4
   00:00:11,678 --> 00:00:14,389
   No, no, there's one line.
   We're in it.
   
   5
   00:00:14,556 --> 00:00:16,225
   I'm just getting a hot dog.
   
   6
   00:00:16,391 --> 00:00:17,643
   We're all getting hot dogs.

Making and Adding Dictionaries

Since when a file is zipped, it only uses the repetitions within a particular file to determine what can be compressed (rather than looking at what patterns are shared between multiple files), adding dictionaries of n-grams to subtitle files greatly affects how compressible they are. The following functions are used in the assembly and concatenation of dictionaries based on country-specific corpora:

import numpy, files, os, json, pickle
import nltk
 
def makengram(country, n, source):
    """
    make a list of ngrams
    country: string, the country of the corpus that's being made into ngrams
    n: digit, the n in ngram
    source: string, the type of corpus. e.g. lowercase, sansnewline, etc
    """
    directory = "subtitles/%stxts/" % (country)
    ngrams = []
    for filename in os.listdir(directory+source):
        #split each file in the country-specific directory into ngrams
        if filename[0] == ".":
            #skip the files that are of no use to us
            print "skipping", filename
            continue
        print "running", filename
        content = files.readFile(directory+source+"/"+filename)
        #first we have to spit each file into individual words
        words = nltk.word_tokenize(content)
        #then we can run the proper nltk function on the list of words to get the ngrams
        if n == 1:
            ngrams += words
        elif n == 2:
            ngrams += nltk.bigrams(words)
        elif n == 3:
            ngrams += nltk.trigrams(words)
        else:
            ngrams += nltk.util.ngrams(words, n)
    #save the list of ngrams in a text file that can easily be read by python later
    print "saving %s%s%s%dgrams.txt" % (directory, source, country, n)
    files.writeFile("%s%s%s%dgrams.txt" % (directory, source, country, n), (pickle.dumps(ngrams)))
    print "saved."
    return
 
def plaintext(ngrams, n):
    """
    turn a list of ngrams in format [["this", "is"], ["is", "ngrams"]] into a string in format "this is\nis ngrams"
    in other words, turn a list into plaintext easily readable by humans
    ngrams: list, list of ngrams to be converted
    n: the n in ngram (how many words are in each ngram)
    """
    topstr = ""
    for item in ngrams:
        if n == 1: topstr += item
        else:
            for string in item:
                topstr += string+" "
        topstr += "\n"
    return topstr
 
def makedict(country, n, lengths):
    """
    make plaintext dictionaries of the most frequent entries from a list of ngrams
    country: string, the country the corpora the dictionary comes from
    n: digit, the n in ngram
    lengths: list of digits, the lengths of the dictionaries to be assembled
    """
    print ("loading ngrams from subtitles/%stxts/lowercase%s%dgrams.txt" % (country, country , n))
    ngrams = pickle.loads(files.readFile("subtitles/%stxts/lowercase%s%dgrams.txt" % (country, country, n)))
    #find the frequencies of all the ngrams
    print "determining frequency..."
    fdist = nltk.FreqDist(ngrams)
    for length in lengths:
        #make a plaintext dictionary for each length in lengths
        top = fdist.keys()[:length]
        topstr = plaintext(top, n)
        #save each dictionary
        print ("saving to subtitles/%stxts/dictsizes/%dgram/lowercase%s%dgramdict%d.txt"
               % (country, n, country, n, length))
        files.writeFile("subtitles/%stxts/dictsizes/%dgram/lowercase%s%dgramdict%d.txt"
                        % (country, n, country, n, length), topstr)
    return
 
def concatenatefiles(country, dictsizes, xgram, source):
    """
    add dictionaries to each file in a folder, and save the concatenated files to a new location
    country: string, country of the corpus
    dictsizes: list of digits, the sizes of the dictionaries to be added
    xgram: digit, the n in ngram
    source: string, the type of corpus (e.g. lowercase)
    """
    for dictsize in dictsizes:
        ngrams = files.readFile("subtitles/%stxts/dictsizes/%dgram/%s%s%dgramdict%d.txt"
                                % (country, xgram, source if source=="lowercase" else "",
                                   country, xgram, dictsize))
        for filename in os.listdir("subtitles/%stxts/%s" % (country, source)):
            print "concatenating %s %dgram %d with %s" % (source, xgram, dictsize, filename)
            content = files.readFile(os.path.join("subtitles/%stxts/%s" % (country, source),filename))
            concatenated = ngrams + "\n\n" + content
            dest = ("subtitles/%stxts/dictsizes/%dgram/%sunzipped/%sdict%d"
                    % (country, xgram, source if source=="lowercase" else "",
                       country, dictsize))
            if not os.path.exists(dest):
                os.mkdir(dest)
            files.writeFile(os.path.join(dest, filename), concatenated)
    return

makengram() creates a list of ngrams, which is then saved into a text file using pickle. makedict() creates a plaintext dictionary of a user-specified length out of the previously saved ngrams. concatenatefiles() adds the full text of a dictionary to each file in a folder.

Recompressing Files

After you're done getting text files in the format you want them in, with the dictionaries you want added on to them, you'll need to compress everything back into zipfiles. compress_files() will turn each file in a folder into its OWN zipfile (so a folder with 100 items in it would result in 100 zipfiles), while compress_folder() will compress every file in a folder into one zipfile that holds all of them (so a folder with 100 items would result in 1 .zip).

def compress_files(country, source):
    """
    compress each file in a folder into its own zipfile
    """
    directory = 'subtitles/%stxts/%s/' % (country, source)
    folders = os.listdir(directory)
    for folder in folders:
        dest = ('subtitles/%stxts/%szipped/%s'
                % (country, source if source!="unzipped" else "", folder[:-4]))
        path = directory+folder
        zf = zipfile.ZipFile(dest, "w", zipfile.ZIP_DEFLATED)
        print "zipping %s to %s" (path, dest)
        zf.write(path, folder)
        pass
    print "DONE ZIPPING"
    return
 
def compress_folder(directory, destfolder):
    """
    zip all files in a folder to ONE zipfile
    """
    folders = os.listdir(directory)
    for folder in folders:
        dest = os.path.join(destfolder,folder)+".zip"
        path = os.path.join(directory, folder)
        zf = zipfile.ZipFile(dest, "w", zipfile.ZIP_DEFLATED)
        abs_src = os.path.abspath(path)
        print "zipping %s to %s" % (path, dest)
        for dirname, subdirs, files in os.walk(path):
            for filename in files:
                absname = os.path.abspath(os.path.join(dirname, filename))
                arcname = absname[len(abs_src) + 1:]
                zf.write(absname, arcname)
        zf.close()
        pass
    print "DONE ZIPPING"
    return

Analysis

I wrote scripts for three different kind of analyses: point-by-point analysis, where each episode represented a point of data; dictionary analysis, where the compression ratio for the ENTIRE corpus was recorded for files with a bunch of different-sized dictionaries added to the files; and n-gram analysis, where we looked specifically at the frequency of particular n-grams and their component parts.

Point by Point

Point by point analysis uses each episode of a point of data, and records a whole bunch of information about it.

import os, zipfile, json, numpy, re, nltk
 
def data_output(countries, source, dest):
    """
    output and save the information about each datapoint (so each episode) in a format suited for analysis and save a dictionary form because why the hell not
    countries: list of strings, countries that the data should be gathered from
    source: string, type of corpus
    dest: string, where the final, formatted file should be saved
    USER CALLS THIS FUNCTION, ALL THE REST ARE HELPERS
    """
    print "show\tseason\tepisode\tcountry\tzipsize\tunzipsize\tratio\tlines\twords per line\tWPL std dev\tunique words\tgenre\tyear"
    datadict = {}
    output = ""
    for country in countries:
        #output and save for each country
        countryoutput = ""
        directory = "subtitles/%stxts" % (country)
        #load previously assembled info from imdb about each show
        showinfo = json.loads((files.readFile("subtitles/%ssubs/%sinfo.txt" % (country, country))))
        for filename in os.listdir("subtitles/%stxts/zipped" % (country)):
            #collect information about each episode
            datapoint = {}
            getdata(filename, directory, datapoint, showinfo, source)
                #get zipsize, unzipsize, ratio
                #also country, genre, episode, season, showname, year (from showinfo)
            countryoutput += printdatapoint(datapoint)+"\n"
            datadict[filename] = datapoint
        output += countryoutput
        #save dictionary for specific country (I do not remember why I did this)
        name = "subtitles/%stxts/%sdata%s.txt" % (country, country, source)
        save_dict(datadict, name)
    #save output to specified filename/location
    files.writeFile(dest, output)
 
def getdata(filename, directory, datapoint, showinfo, source):
    """
    get information about the episode
    """
    show = filename[0:-4]
    datapoint["show"] = show
    datapoint["episode"] = filename[-2:]
    datapoint["season"] = filename[-4:-3]
    zipsource = os.path.join(directory, source+"zipped", filename)
    unzipsource = os.path.join(directory, source, filename+".txt")
    datapoint["zipsize"] = os.path.getsize(zipsource)
    datapoint["unzipsize"] = os.path.getsize(unzipsource)
    datapoint["ratio"] = float(datapoint["zipsize"])/datapoint["unzipsize"]
    lineinfo = get_lineinfo(directory, filename)
    datapoint["lines"] = lineinfo["lines"]
    datapoint["wpl"] = lineinfo["wpl"]
    datapoint["wpl st dev"] = lineinfo["wpl st dev"]
    datapoint["unique"] = lineinfo["unique"]
    datapoint["country"] = showinfo[show]["Country"]
    datapoint["genre"] = showinfo[show]["Genre"]
    datapoint["year"] = showinfo[show]["Year"][:4]
    return
 
def printdatapoint(datapoint):
    """
    print and return the information about an episode 
    """
    data = "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % (
         datapoint["show"], datapoint["season"], datapoint["episode"],
         datapoint["country"], datapoint["zipsize"], datapoint["unzipsize"],
         datapoint["ratio"], datapoint["lines"], datapoint["wpl"],
         datapoint["wpl st dev"], datapoint["unique"], datapoint["genre"],
         datapoint["year"])
    print data
    return data
 
def get_lineinfo(directory, filename):
    """
    get a bunch of line-related information about a single file, put it in a dictionary
    """
    lineinfo = {}
    #load the file
    content = files.readFile(os.path.join(directory, "unzipped", filename+".txt"))
    #find the number of unique words in the file
    uniquewords = set(nltk.word_tokenize(content))
    lineinfo["unique"] = len(uniquewords)
    #find the number of lines
    lines = re.compile('.+\n?.*\n\n')
    lineslist = lines.findall(content)
    lineinfo["lines"] = len(lineslist)
    #find the average number of words per line
    linelengths = []
    for line in lineslist:
        linelength = len(nltk.word_tokenize(line))
        linelengths.append(linelength)
    lineinfo["wpl"] = numpy.mean(linelengths)
    lineinfo["wpl st dev"] = numpy.std(linelengths)
    return lineinfo
 
def save_dict(dictionary, filename):
    """
    save the dictionary of show information
    """
    contents = json.dumps(dictionary, indent=4, sort_keys=True)
    print "saving info to "+filename
    files.writeFile(filename, contents)
    return

This will output and save a table full of information about each episode, including genre, compression ratio, country, and year of production.

Dictionary Size

Dictionary size analysis computes the compression ratio of entire corpora concatenated with n-gram dictionaries of different sizes and zipped (compress_folder()). I did a lot of manual moving files around to get stuff organized for this, because by the point we were doing dictionary analysis my filing system was a complete disaster.

import os
 
def dictsizes(ngram, countries, case):
    """
    find the compression information about entire corpuses concatenated with dictionaries of different sizes
    ngram: digit, the n in ngram
    countries: list of strings, the countries to be analzyed
    case: the type of dictionary and corpus (e.g. lowercase)
    """
    table = "country\tdictsize\tzipsize\tunzipsize\tratio\n"
    for country in countries:
        #get the compressed size of directory with the corpus in it, the uncompressed size, and the compression ratio for each each dictionary size
        directory = "subtitles/%stxts/dictsizes/%dgram" % (country, ngram)
        for filename in os.listdir(os.path.join(directory, "%szipped" % (case))):
            #get the zipsize, the unzipsize, and the ratio of the two
            dictsize = filename[6:-4]
            country = filename[:2].upper()
            zipsize = os.path.getsize(os.path.join(directory, "%szipped" % (case),
                                                   filename))
            unzipsize = 0
            for item in os.listdir(os.path.join(directory, "%sunzipped" % (case), filename[:-4])):
                #calculate the unzipsize by adding the sizes of all the files in the directory
                unzipsize += os.path.getsize(os.path.join(directory, "%sunzipped" % (case), filename[:-4], item))
            ratio = float(zipsize)/unzipsize
            #add the information to a table in a format suitable for analysis
            table += "%s\t%s\t%s\t%s\t%s\n" % (country, dictsize, zipsize, unzipsize, ratio)
    print table
    print "saving table to subtitles/dictsizes%s%dgram.txt" % (case, ngram)
    #save the table
    files.writeFile("subtitles/dictsizes%s%dgram.txt" % (case, ngram), table)
    return

N-Grams

Finally, I also wrote some stuff to do analysis on the n-grams themselves. Using pickled trigram files, this script will find the frequency of each trigram in the list, the frequency of each bigram in each trigram, and the frequency of each word in each trigram.

import os, files, pickle, json
import nltk
 
def get_ngram_freq(country):
    """
    analyze the frequency of each bigram and word in each trigram in a file
    first, load all the files and determine the frequency of each ngram in them
    folder = "subtitles/%stxts/" % (country)
    """
    #trigrams
    print "loading trigrams from %slowercase%s3grams.txt" % (folder, country)
    trigrams = pickle.loads(files.readFile("lowercase%s3grams.txt" % (folder, country)))
    print "getting trigram frequency..."
    trigramfreq = nltk.FreqDist(trigrams)
    #bigrams
    print "loading bigrams from %slowercase%s2grams.txt" % (folder, country)
    bigrams = pickle.loads(files.readFile("%slowercase%s2grams.txt" % (folder/country)))
    print "getting bigram frequency..."
    bigramfreq = nltk.FreqDist(bigrams)
    #words
    print "loading words from %slowercase%s1grams.txt" % (folder, country)
    words = pickle.loads(files.readFile("%slowercase%s1grams.txt" % (folder, country)))
    print "getting word frequency..."
    wordfreq = nltk.FreqDist(words)
    frequency = {}
    print "gathering frequency data for ALL trigrams:"
    for trigram in trigramfreq.keys():
        #find the frequency of each bigram and trigrm within
        print "analyzing", trigram
        tridict = {}
        tridict["name"] = trigram
        counter = 0
        for word in trigram:
            counter += 1
            tridict["word%d" % (counter)] = wordfreq[word]
        tridict["bigram1"] = bigramfreq[trigram[:2]]
        tridict["bigram2"] = bigramfreq[trigram[1:]]
        tridict["trigram"] = trigramfreq[trigram]
        frequency[trigram] = tridict
    #save the information in a text file so that it can be easily read by python later
    print "saving frequency data to %s%sfreqdata.txt" % (folder, country)
    files.writeFile("%s%sfreqdata" % (folder, country), pickle.dumps(frequency))
    print "DONE"
    return
 
def freq_data_output(country):
    """
    output frequency data (compiled using get_ngram_freq()) in a format suited for analysis
    """
    freq = pickle.loads(files.readFile("subtitles/%stxts/%sfreqdata" % (country, country)))
    print "word1\tword2\tword3\tfreqOfTriGram\tfreqWord1\tfreqWord2\tfreqWord3\tfreqBiGram1-2\tfreqBiGram2-3"
    fullstr = ""
    for trigram in freq:
        item = freq[trigram]
        line = "%s\t%s\t%s\t%d\t%d\t%d\t%d\t%d\t%d\n" % (
              item["name"][0], item["name"][1], item["name"][2], item["trigram"],
              item["word1"], item["word2"], item["word3"], item["bigram1"],
              item["bigram2"])
        print line
        fullstr += line
    print "saving to subtitles/%stxts/%sfreqdataforanalysis.txt" % (country, country)
    files.writeFile("subtitles/%stxts/%sfreqdataforanalysis.txt" % (country, country), fullstr)
    print "DONE"
    return
Personal tools
Namespaces
Variants
Actions
Navigation
HowTos
Troubleshooting
Scripts
Toolbox