A typical analysis using Structurama has the following steps: (1) Data entry into a file; (2) Reading the data into computer memory; (3) Setting the model for the analysis; (4) Running the Markov chain Monte Carlo analysis; and (5) Summarizing the results of the analysis. Steps 2 through 5, above, use the commands "execute", "model", "mcmc", and "showmeanpart/showtogetherness/shownumpops", respectively. The program provides built-in help. You can type "help" or "help <command_name>" to get a list of options or information on a particular command.
1. Getting started
If you have not already done so, you should download the program. You should download an executable (the Macintosh) or the source code. If you are running linux (or some other unix-type operating system, you should compile the program using the Makefile that is included. Simply type "make". and the example data files. You can download the program here.
2. Data format
Structurama has a unique, NEXUS-like, file format. You should format your data in a text file. You can use any of a number of good text-editing programs, including Microsoft Word (make certain that you save the file as a text file) or BBEdit. You can use the example data files as a model for your own. The file format, however, is quite simple. The data are entered in a data block (a data block starts with "begin data;" and ends with "end;"). You can also enter commands into the data file. If you do this, the commands should be in a structurama block (a structurama block starts with "begin structurama;" and ends with "end;").
The data block contains your observations, which are assumed to be alleles at different loci. As an example, consider this simple data file:
Here, there are three individuals (Larry, Moe, and Curly) and four loci. The alleles observed at the first locus for Larry are 1 and 1; that is, Larry is homozygous at the first locus, with both alleles arbitrarily labeled "1". Larry is heterozygous at the second and fourth loci and homozygous at the others. If information on an allele is missing, you should enter a question mark (?) for that allele. You can see that we are uncertain about one of the alleles at the third locus for Curly.
If your loci are haploid, then you should only have one allele entered for each locus (e.g., "( 3 )");
As mentioned above, you can also enter commands directly into the data file. For example, the following instructs the program to set the model so that there are two populations and then runs a Markov chain Monte Carlo analysis for 10,000 cycles.
You can enter multiple data blocks into the same file. This allows you to perform multiple analyses automatically, which might be useful for a simulation study. You can also enter comments--bits of text that the program skips--into the file. Comments are contained in between square brackets (e.g., "[This is a comment.]").
You should be careful with your use of tabs in the file. We recommend that you not enter tabs. Although the program Structurama does not care, you might consider making the file easy to read and format it nicely. Because you can embed commands and comments directly into the file, the data file can serve as a sort of record of any analyses you have performed. If a reader of one of your papers asks you for your data so that he/she can replicate your analysis, you can simply send the text file used in your original analysis; it will contain not only the data you used, but if properly formatted, a record of your analyses.
You should also be careful that your file is saved with the correct line endings. It turns out that Unix, Macintosh, and DOS all use different line endings in text files. In a program like BBEdit, you can save the file with different line endings. This usually shouldn't be a problem. However, with Macintosh OSX, you can use Structurama either from the terminal or as a "double clickable" application. If you use the program from the terminal, then you need to make certain your files have Unix line endings. Similarly, if you use the OSX application version of the program (the version with the nice icon), then you need to make certain to save the file with Macintosh line endings.
3. Execute the data file
Once you have entered you data into a text file, you need to read the data into computer memory. After starting the program Structurama, type "execute <file_path_and_name>", where <file_path_and_name> are the path (directory) and name of the text file containing the data. If the data file is in the same directory as the Structurama application, then you can omit the path to the file. For example, I have a file named "mullen.in" in a directory on my computer called "structurama". I can execute this file by typing:
when using the command-line version of the program. Finally, because the stucturama program is in the same directory as the data file called "mullen.in", I can simply type
or, because I am lazy, I can type
Structurama supports the shortest unambiguous spelling of a command. Because "execute" is the only command that starts with "e", I can simply type "e" and the program interprets this as "execute". I save six key strokes by using the shortest unambiguous spelling of execute!
4. Set the model
Once the data has been read into memory, you need to specify the model. You set the model using the "model" command. You have two basic options here. First, you can assume that the number of populations is fixed to some number. In this case, you are using a model that is commonly applied today, and which was first described by Pritchard et al. (2000). Second, you can assume that the number of populations is a random variable that follows a Dirichlet process prior. In this case, you are assuming a model first described by Pella and Masuda (2006).
The basic options can be summarized by the following examples:
This fixes the number of populations to two. You can change the number of populations by changing the numpops option; "model numpops=3", "model numpops=4", "model numpops=5", for example, respectively fix the number of populations to three, four, and five.
You can allow the number of populations to be a random variable with a Dirichlet process prior as follows:
where "rv" means "random variable". At this point, we will digress a bit and describe the Dirichlet process prior. The Dirichlet process prior is a model that is often used in Bayesian clustering. Sometimes the model is referred to as the "Chinese restaurant process", and it is perhaps most easily described using a Chinese restaurant as an example. Suppose that we have a line of patrons, perhaps Larry, Moe, and Curly, who will enter a Chinese restaurant one at a time. The restaurant has a countably infinite number of tables (it is a very large restaurant). The first patron, Larry, enters the restaurant and sits at one of the tables (this with probabilty one). Then the second patron, Moe, enters the restaurant. Moe can either sit at the same table as Larry [with probability 1/(1+alpha)] or at an unoccupied table [with probability alpha/(1+alpha)]. Let's say that Moe happened to sit at the same table as Larry. Clearly, at this point only one of the tables in the restaurant is occuppied (with two people). The next (and last) patron, Curly, now enters the restaurant. Curly can either sit with Larry and Moe [with probability 2/(2+alpha)] or at an unoccupied table [with probability alpha/(2+alpha)]. Let's say that Curly sits at a previously unoccupied table. Now that all of the patrons have entered, we see that two of the tables are occuppied. One of the tables has two patrons (Larry and Moe) and the other occupied table has only one patron (Curly). Note that the Dirichlet process prior has only a single parameter, alpha, that controls the tendency of patrons to sit together. If alpha is small, you tend to end up with few tables occupied. If, on the other hand, alpha is large, patrons tend to sit at unoccupied tables. When a patron enters the restaurant, he/she will sit at a particular occupied table with probability (number of current occupants) / (normalizing constant). Hence, tables with lots of occupants tend to attract others to sit at the same table. Under the process described here, one can calculate all sorts of interesting probabilities. For example, one can calculate the probability of a certain configuration of patrons at tables (and importantly, the order in which patrons enter the restaurant does not change this probability), the probability that some number of tables are occupied, and the probability that two individuals find themselves sitting at the same table. In the program Structurama, "tables" are equivalent to populations and the individuals in the analysis are equivalent to the patrons entering the restaurant.
Now that you have a very basic description of the Dirichlet process, you can see that the parameter alpha is a very important one. In Structurama, you have two choices in how you deal with the alpha parameter. First, you can fix alpha to some value. All you have to do is specify the prior mean of the number of tables. The program then calculates the value of alpha that gives the desired prior mean. You can implement this option using the following command:
model numpops=rv concparmprior=fixed_expk(5)
which sets the prior mean of the number of populations to five. You can also set the value of alpha directly, using "concparmprior=fixed(alpha_value)". The other option you have is to let alpha itself be a random variables. If you choose this option, then you have to specify a probability distribution (called a hyperprior) for alpha. In Structurama, you can allow alpha to be a random variable with a gamma probability distribution. The gamma distribution has two parameters, called the shape and scale parameters. So, if you want to treat alpha as a random variable then you have to choose a value for the shape and scale parameters. (You cannot let the shape and scale parameters have probability distributions.) The following command allows alpha to have a gamma prior probability distribution:
model numpops=rv concparmprior=gamma(1,1)
We offer no advice for setting the shape and scale parameters. However, the mean of the gamma distribution is shape/scale and the variance is shape/scale^2.
New to version 2.0 of the program is the ability to treat the alleles of an individual as being an admixture from several populations. Admixture is specified as follows:
With admixture and when treating the number of populations as a random variable, the model is a Hierarchical Dirichlet Process prior model (AKA the "Chinese Restaurant Franchise Process). When the number of populations is fixed, then Structurama uses the same prior model as the program Structure.
5. Run the Markov chain
Structurama uses a numerical technique called Markov chain Monte Carlo (or MCMC) to calculate the probability of assigning individuals to populations. One of the most basic ideas in Structurama is that assignments of individuals to populations can be thought of as partitions. Going back to our Larry, Moe, and Curly example, there are a total of five different ways that Larry, Moe, and Curly can be partitioned among populations:
(The number of possible partitions for n individuals is given by the Bell numbers.) Structurama labels partitions of individuals among populations using the "restricted growth function" notation of partitions; the first individual is always given the population index 1. If the second individual is assigned to the same population as the first, then it is also given the index 1. Otherwise, the second individual is given the index 2. This labeling scheme continues for the remaining individuals. Structurama does one of two things: Either you have fixed the number of populations (in this example involving three individuals, you could fix the number of populations to 1, 2, or 3 ) or you have allowed the number of populations to be a random variable. For the example with three individuals, if you fixed the number of populations to, say, two, then structurama would attempt to calculate the probability of the following three partitions: (1,2,2), (1,2,1), (1,1,2). Note that when you fix the number of populations to two, that you have effectively assigned zero prior probability to the partitions involving one population (1,1,1) and three populations (1,2,3). If in our example with three individuals you allow the number of populations to be a random variable with a Dirichlet process prior, then all five possible partitions have some prior probability, and the program attempts to calculate the probability of all five.
MCMC samples partitions in proportion to their probability. Structurama uses a particulary efficient variant of MCMC called Gibbs sampling. Each MCMC cycle involves a Gibbs scan of all the individuals. Hence, if you run the Markov chain for a total of 1000 cycles and you have 14 individuals in your data file, Structurama will have performed a total of 14,000 Gibbs samples. Every so often, the program outputs the current partition to a file. This file, by default, has the extension ".p". You can control how often the program outputs the current state of the Markov chain to the file using the "samplefreq" option. For example, the following command
mcmc ngen=10000 samplefreq=100 printfreq=25
specifies that the Markov chain should be run for 10,000 cycles, that the state of the chain should be output to a file every 100th cycle, and that some indication of the run status should be printed to the screen every 25th cycle.
You can also specify the name of the file containing the MCMC samples using the "outfile" option.
6. Summarize results
You can summarize the results of a MCMC analysis of population assignment using one of several commands: showmeanpart, shownumpops, showtogetherness.
The fraction of the time the MCMC algorithm samples a particular partition of individuals among populations is a valid approximation of the posterior probability of that partition. Summarizing the results of a MCMC analysis of population assignment is not trivial. Structurama summarizes results of a MCMC analysis using a couple of different methods. One thing the program provides is a "mean" partition (showmeanpart). This is the partition that minimizes the squared distance to all of the sampled partitions. We use a distance on partitions originally described by Gusfield (2004). The program will also report the posterior probability of the number of populations (shownumpops). Finally, the program will calculate the posterior probability that two individuals are in the same population (showtogetherness).