The Analyze Transcripts tool

Updated 2024-03-03, for Game Server 6.026

The Analyze Transcripts tool (scripts/analyze-transcripts.sh; the underlying Java class is edu.wisc.game.tools.AnalyzeTranscripts) can be used to extract transcripts for specified experiment plans, specified players, or specified repeat users, and to carry out certain types of data analysis.

When processing trasnscripts, the analyzer removes duplicate entries (which are sometimes written by the server) and entries with code 3 "empty cell", because our researchers become sad when they see players grabbing at empty cells (i.e. already-removed pieces).

Do you have this script? And where?

Depending on what computer you use, the way to obtain and access this script (and other data analysis scripts) may be a bit different. See Script location for details.

Usage - overview

In general, the usage of the script can be described as follows:

      /home/vmenkov/w2020/game/scripts/analyze-transcripts.sh [options] data_selector
where data_selector specifies the set of transcripts to extract, and options specify the mode of processing.

If running the script on downloaded data

You may run the analysis scripts not on data organically produced by the Game Server on the local host, but on data pulled from remote server. (E.g. you run the analysis script on the CAE host, looking at the data that had been accumulated on a Plesk host). In this case, when you ran the pull script, it must have told you the name of the config it created, e.g. w2020_game_wwwtest_rulegame_2024_01_30.conf, which describes the location of the downloaded data (the file directory location, and the database name). When running the analysis script on these data later, make sure to pass the name of the config file to the analysis script, with the -config option, e.g.

 ... -config w2020_game_wwwtest_rulegame_2024_01_30.conf  ...

Data selection

There are several data selection modes, illustrated below. To save space, the script name in the examples below is given without the full path, which in reality you will need to indicate (unless you add the directory where the script resides to your PATH). The output goes to the output directory, which by default is named tmp; you can also specify its location using the -out option (see below). If the output directory does not exist, the tool will create it.

1. Select by experiment plan name

      
analyze-transcripts.sh [options] plan1 [plan2...]
  
e.g.
      
analyze-transcripts.sh pilot06
  
or
      
analyze-transcripts.sh '%OCT26%'
  

(You can also write analyze-transcripts.sh -plan '%OCT26%', but the -plan option is unnecessary, since selection by plan name is the default selection method).

As the above example shows, the percent character can be used for wildcard-based selection, to select all matching plans. The percent sign can also be used in all other selectors discussed below. (We use % rather than * because that's what used in the SQL server. It also avoids certain problems with command line processing).

Specifying an experiment plan is equivalent to specifying a list of player IDs (see 2 below) that includes all player IDs assigned to that experiment plan.

2. Select by player IDs

In the Rule Game Server, a player ID corresponds to a round of one or several series of episodes played according to a particular trial list. An M-Turker has a single player id, while a repeat user may have multiple ones.

      
analyze-transcripts.sh -pid playerId1 [playerId2 ...]
  
E.g.
      
analyze-transcripts.sh -pid 'pk%'
  
or
      
analyze-transcripts.sh -pid 'RepeatUser-%'
  
(The second examples picks all player IDs created by GS4's repeat users, since that's what their player IDs look like).

3. Select by repeat user ID

Internally, repeat users are identified by unique user IDs. (You can see IDs for various player e.g. by logging in to the SQL server with mysql and issuing commands such as

use game;
select * from User;
  
)

One can pass a list of user IDs to the Analyze Transcript tool as well, as follows:

      
analyze-transcripts.sh -uid uid1 [uid2 ...]
  
E.g.
      
analyze-transcripts.sh -uid 2 3 6
  

Selecting data by specifying a repeat user ID is equivalent to selecting by player ID and listing all player IDs associated with that repeat user.

4. Select by repeat user nickname

A repeat created via the login screen may have a nickname. You can see the nicknames of repeat users with the same SQL command. The Analyze Transcript tool can take a list of user nicknames instead of a list of user IDs, e.g.

      
analyze-transcripts.sh -nickname 'John%Doe' '%Walrus%'
  

Note that one should not use spaces inside command-line arguments. If your nickname has a space inside it, you may use a percent sign (%) on the command line above instead.

Combining different selectors

What happens if your command has several different selction options, e.g. by plan and by player ID, or by player ID and by use ID? As of ver. 6.026 this works as follows.

Example: the selector

 -plan 'ep/rule_ambiguity/ambiguity%'  -pid 'A%'
will select the players whose player IDs started with an A (i.e. mostly M-Turkers) who played one of the plans with names started with ep/rule_ambiguity/ambiguity

Options controlling input and output

If you have a large list of arguments (e.g. player IDs), you can put them in a CSV file (one items per line, to be read from the first column of the file; all other columns are ignored), and use the -file option on the command line. E.g.

      
analyze-transcripts.sh -pid -file my-pid-list.csv
  

This is handy if, for example, you have a list of "good" players prepared for you separately by another team.

Normally, the Transcript Analyzer reads saved data (transcripts etc) from /opt/tomcat/saved. However, this can be changed with the option -in input_dir_name. If you do that, make sure that you are still connecting to the right MySQL server, i.e. the data in the SQL server are consistent with the data in the input directory.

You can use the option -out directory_name to have the output written to a directory other than the default tmp.

Options controlling data processing

The Transcript Analyzer identifies all episodes played by specified players (or the players in the specified experiment plan), and writes out a separate transcript section file for each (player, rule set) combination (i.e. for each series played by each player). Additional computations are controlled by the options that appear on the command line before the selection specifiers.

1. If the -pre option is given, an additional column named precedingRules is produced in the split transcript files. The value in this column describes the player's experience prior to starting playing with the current rule set. Specifically, this column contains the list of the rule sets that preceded the current rule set in the trial list, in the chronological order, separated by semicolons. Naturally, if the current rule set is the first rule set in the trial list, the precedingRules column will be empty.

2. Curve fitting. If the -nofit option is given, no additional computations are carried out.

Otherwise, log-like curve-fitting is carried out, either with no p0, or with p0, computed based on a particular baseline player model. Currently, two such models are supported:

The value of p0 for each move attempt is written as a last column of each transcript section file, when applicable.

The curve fitting parameters are written into a file named summary-flat.csv, summary-p0-COMPLETELY_RANDOM.csv, or summary-p0-MCP.csv, as the case may be.

3. Board positions. If the -boards option is supplied, the analyzer saves the board position before each move. It is written into the same CSV output file, in a column named "board" that follows the column "p0". The board description is saved in a JSON format, similar to what is used in the client-server communication (as illustrated in the Game API document, under /display).

This option can only be used with the -p0... option.

The computation of p0

Computing the p0 in a sensible way is somewhat complicated by the fact that, when we see a "failed pick" recorded in the transcript, we do not know what the player's true intention was: did he hope to move a game piece into a bucket (but could not, because the piece wasn't movable), or did he just want to find if the piece is movable (with the intent to just drop it even if it was)?
What the player wanted to do How the system responded What was recorded
Find out if a piece was movable Can't be moved Failed pick
Can be moved (and then the player drops it in the middle of the board) Successful pick
Move the piece to a bucket of his choice Can't be moved Failed pick
Can moved, but not to the chosen bucket Faied move
Can be moved to the chosen bucket Successful move

If the latter were the case, then we would also see a lot of successful picks in our transcripts, and we don't. Therefore, we compute p0 for failed picks based on the first assumption, i.e. give them the same value as for moves (succsesful and failed).

There are still some successful picks in the record. I am inclined to explain them primarily by "slips of the fingers" (it's not that easy to uses a mouse or a touchscreen device error-free, so in some cases a player can accidentally drop a piece on the board before it's brought to a bucket). Nonetheless, for them we compute p0 as if they, indeed, resulted from an intent to just try a pick

The Completely Random Player

A completely random player has no memory whatsoever of his actions. Every time, it randomly chooses a piece, and a destination bucket, and makes an attempt to move the chosen piece to the chosen bucket. Accordingly, the value of p0 for all successful and failed move attempts, and for failed pick attempts, is computed as p0 = B/(4*m), where m is the number of pieces on the board, and B is the number of possible moves (i.e., of allowed-move pairs (origin,destination)). For sucessful picks, we use p0 = M/m, where M is the number of movable pieces currently on the board.

MCP1 (Minimally Competent Player 1

This models a player who does not repeat failed move/pick attempts on the same unchanged board. That is, if he has a failed pick for piece A, it won't make another attempt to pick or move A. If he has a failed move attempt for piece A to bucket b, than he won't try to move A to b again, but he still may try to move A to other buckets. However, MCP1 does not make any "positive inferences", along the lines of, "if I have had a failed move attempt for piece A to bucket b, this must mean that one of the other 3 destinations for A is allowed", or "if I have had a succsessful pick attempt for A, it means that A has at least one allowed destination". Accordingly, the p0 for MCP1 are computed as follows:

p0 = B/(4*m - F), for failed and successful moves and for failed picks
p0 = M/(m-f), for successful picks.

Here, F is the number of "known impossible moves", which includes all (piece,bucket) pairs which can be inferred from a failed move or from a failed pick. (A failed pick, of course, results in 4 such known impossible pairs). Similarly, f is the number of "known immovable pieces", which counts pieces for which the player has had a failed pick in the current position, or 4 failed moves (trying all 4 buckets) in the current position.

The values of F and f reset to 0 once a piece is successfully removed from the board and the board position changes. In other words, an MCP does not retain any memory of his actions from one position to the next.

Output format

The Analyze Transcript files produces multiple output files. It creates a directory tree, with a directory for every rule set involved. Each directory contains a number of CSV files, one file for each player who played that rule set.

Each CSV file, with 1 line per move, contains at least the following columns:

ruleSetName,playerId,experimentPlan,trialListId,seriesNo,orderInSeries,episodeId,moveNo,timestamp,y,x,by,bx,code
    
The first 7 columns describe the episode (who played it, under which rule set, and where that fits in that player's overall history). The subsequent columns describe an individual move.

Depending on options, this may be followed by some of the following additional columns:

p0,board,precedingRules

Post processing: "try again" numbers

As per Paul's request (Oct 2022) we have a post-processing script, which can be applied to split transcript files (computed with the -pre option) to compute a measure of the player's procilivity to try moving the same game piece.

The usage is

      ~vmenkov/w2020/game/scripts/does-try-again.pl -out=outputFile.csv files_or_directories
    

The command-line arguments(s) are CSV files (*.split-transcripts.csv) produced by the Analyze Transcript tool, or directories containing such files. If directories appear among the arguments, each directory will be processed recursively.

If the -out=... option specifies the name of the output file to be created. If omitted, the file will be named out.csv.

Example:

	~vmenkov/w2020/game/scripts/does-try-again.pl .
      
This will recursively process all split transcript files in the current directory, and write fill out.csv.

For more details, see Example 2 below.

The analysis in this script categorizes all actions (other than the last one, which is not followed by another action) into 3 groups: [doesTryAgain] , [doesNotTryAgain], [other], as follows:

In other words, the [other] group includes succesful moves and failed picks, because both of them ought to be followed (in the former case, physically; in the latter, logically) by an attempt at some other piece. All other actions are categorized into [doesNotTryAgain] or [doesTryAgain] based on whether the subsequent action involved the same piece as this action.

Examples

Example 1

(Here, we assume that we run the script in /home/vmenkov. If running in your own directory, adjust the paths as appropriate).

Let's say that we have a list of "interesting" playerIDs in the file w2020/slack/pid.csv. (This is based on what Gary's team sent in late October 2021). We just want to extract and split the transcript data for these players, without curve fitting.

w2020/game/scripts/analyze-transcripts.sh -nofit -pid -file w2020/slack/pid.csv  -out tmp-3-flat >& tmp.log
w2020/game/scripts/analyze-transcripts.sh -p0 random -nofit -pid -file w2020/slack/pid.csv  -out tmp-3-random >& tmp.log
w2020/game/scripts/analyze-transcripts.sh -p0 mcp1 -nofit -pid -file w2020/slack/pid.csv  -out tmp-3-mcp1 >& tmp.log

The output files in the 3 directories, tmp-3-flat, tmp-3-random, tmp-3-mcp1, contain the split transcripts in 3 versions (only differing in the last column, p0). The first has no p0, the 2nd and 3rd have p0 given by the completely random model and the MCP1 model, respectively.

Each line in those files represents one move (or pick) attempt. The code column of the output files contains 0 for a successful move (or pick) attempt, and a non-zero (usually 3) for a failed attempt. The p0 column (where applicable) contains the value of p0 (for the relevant baseline model) before each attempt.

With a large real-life sample, it's likely that the log file will contain some error messages. E.g. the following message is due to the fact that a particular experiment plan no longer exists:

ERROR: Skipping player=A3UKRC3LBM1LEW due to missing data. The problem is as follows:
java.io.IOException: No experiment plan directory exists: /opt/tomcat/game-data/trial-lists/pilot05_1
The following message indicates that a particular plan no longer has any trial list files in its directory:
ERROR: for player A1FKRZKU1H9YFC, no trial list is available for id=nameability_high_first any more

The following message indicates that a rule set file has been (re)moved, and no longer exists at the location specified in the trial list file that was used during the particular player's play.

ERROR: Cannot process data for player=APGX2WZ59OWDN due to missing data. The problem is as follows:
java.io.IOException: Cannot read rule file: /opt/tomcat/game-data/rules/Rule-002.txt
This particular message is only relevant in runs with -p0 on, since we need to know the rules to compute p0. Thus it's likely that there will be fewer split transcripts in the analyses with p0 than in those without p0.

Such phenomena usually result from the research team occasionally rearranging the experiment control files. Still, despite such defects, you will have plenty of remaining good data to work with.

Example 2

Make a list of player IDs of all M-Turk players:

    cd /opt/tomcat/saved/transcripts
    ls A*.csv | perl -pe 's/.transcripts.csv//' > ~/list.tmp
  

Produce split transcripts for all of these players (1511, as of Oct 2022)

      cd
      ~vmenkov/w2020/game/scripts/analyze-transcripts.sh -pre -pid -file list.tmp > out.log
    
The process will take a few minutes. A few errors will be reported, as some of those players played experiment plans that no longer exist, etc. It produces over 4500 split transcripts:
     find tmp -type f -name '*transcripts.csv'|wc
   4513    4513  259944

For each player and "experience" compute the doesTryAgain/doesNotTryAgain numbers:

    cd tmp
    ~vmenkov/w2020/game/scripts/does-try-again.pl -out=again.csv .
     wc again.csv 
  4514   4514 362094 again.csv
    cd
    mv tmp tmp.again-1
    

See also


[Main tools page]   [Main documentation page]