Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow "Statistical binning improves species tree estimation in the presence of gene tree incongruence" Science 12 December 2014: 346 (6215), 1250463 [DOI:10.1126/science.1250463]
The binning code
A pipeline is used for performing the binning step. The code
that we used for binning is available here (DOI 10.13012/C52Z13FT).
Once you unzip the file, look at the README file for usage and installation guidelines.
Note that this pipeline works on *nix-like systems (including MAC) but not on Windows; However, the main
code to perform vertex coloring and to perform compatibility checks are in java
and can run on Windows. You just need to develop some gluing scripts if you are on Windows.
The model species trees for 1X model condition: avian (DOI 10.13012/C56Q1V55) and mammalian (DOI 10.13012/C5BG2KWG).
(for reduced or increased ILS model condition, we simply multiply or divide the branch lengths by 2 or 5).
Steps of the simulation are described in more details here (DOI 10.13012/C5VD6WCZ).
The following archives contain:
1) simulated true gene trees
2) simulated sequence data (alignments in fasta format)
3) estimated gene trees and their bootstrap replicates
Each model condition has a separate directory, with a name of the form [ILS]-[gene_count]-[alignment_length].
ILS can be 1X, 0.5X, or 2X for both datasets.
The gene_count can be 250, 500, 1000, or 2000, for avian, and 200, 400, or 800 for mammalian dataset.
The alignment_length can be 250, 500, 1000, 1500, or true for avian and 500, 1000, or true for the mammalian dataset.
For avian 1X and mammalian 1X model conditions, we include sequence alignments only in 1X-1000-1500, and 1X-800-1000 directories, respectively (thus, all the remaining directories don't have alignment files). Similarly, for avian dataset, we include estimated gene trees only in 1X-1000-[alignment_length] directories. We need to do this because of the dataset size; were we to include all alignments and gene trees, the files would be orders of magnitude larger than what they are now. We can do this because various model conditions with 1X ILS use the same set of underlying simulated gene alignments, and various model conditions with various number of genes use the same gene trees. For model conditions with less than 1500 sites for avian or 1000 sites for mammalian, the alignments need to be trimmed to the first [alignment_length] sites. For example, to get alignments for 1X-1000-1000 avian model condition, the alignments from 1X-1000-15000 directory need to be trimmed to their first 1000 sites, discarding the final 500 sites. We have created two script called create-sub-alignments-avian.sh and create-sub-alignments-mam.sh (available at DOI 10.13012/C53X84KN) to perform this task.
For gene trees, it is possible to create symlinks from model conditions with fewer genes to those with more genes. Unfortunately, some of the symlinks in our avian archive files are broken. We have created a script (called fixlinks.sh, available at DOI 10.13012/C53X84KN) to fix these broken symlinks (we did not want to update our archive files post publication).
To get the entire dataset set up, follow the following steps.
Each record contains files for the true gene trees, gene sequences, estimated gene trees.
- Unarchive all the avian archive files under the a directory (e.g., avian); similarly, unarchive all the mammalian files under the same directory (e.g., mammalian). Be sure to first unarchive estimated gene tree files and then alignment files.
- Create and link avian missing files:
- Copy create-sub-alignments-avian.sh and fixlinks.sh to the avian directory.
- Run fixlinks.sh.
- If you also want to have alignments, run fix-alignments-avian.sh
- Create mammalian missing alignment files
- Copy create-sub-alignments-mam.sh to mammalian directory
- Run fix-alignments-mam.sh
In addition, we provide the definition of bins for all our super gene trees for our avian and mammalian (DOI 10.13012/C5Z60KZF)
datasets. These files contain a pairwise/R*/[50/75]/bin.*.txt file for
each of the model condition. These files are simple text files that give
the gene ids put into each bin.
There are 5 biological datasets.