Paleantology is funded under the NSF Postdoctoral Fellowship in Biology collections track. There’s going to be a lot of data involved. That data will come from a variety of sources:
- Collections: Characters and dates from the ant collections at the Field Museum
- Previously-published matrices of ant morphology.
- GenBank: nucleotide sequences will be scraped from here.
Between these data sources, there are a lot of challenges. Some morphology matrices are only published in PDF form, which makes even obtaining them a challenge. There has also been considerable taxonomic revision in ants, and not all phylogenetic trees are built at the same level. Particularly for the morphological data, the tips on a tree may only be keyed to the subfamily, tribe or genus level.
Below, I’m going to discuss some of the challenges associated with curating combined data sets.
Liberating the morphology data
I’m going to focus on three main morphological matrices: Baroni Urbani, Bolton and Ward (1992), Brady and Ward (2005) and Grimaldi, Agosti and Carpenter (1997, link goes straight to a download of a PDF). The Baroni Urbani matrix forms the basis of the other two.
The latter two matrices were able to be digitized using a fantastic tool called Tabula. Tabula finds X and Y coordinates of tables in a PDF, and uses this to extract them to a flat file (I exported my matrices as PDF). The Baroni Urbani matrix, however, is saved as an image file. If you look in the paper, you might also notice the table is color-coded. That makes it really hard to use optical character recognition, because contrast of the character to the background is poor.
So I had to extract that by hand. I’ve never done hand-translation of a matrix before. Here are my tips:
- Print out the matrix.
- Put a sheet of paper under the line you’re working on.
- Turn off push notices. It’s really easy to get distracted and lose your place.
- Work through the matrix in pieces (in my case, there was a page break in the middle of the table). Error check several lines from each piece before moving on to the next.
Taxonomic Name Resolution
This subject header is misleading, because I’m doing a couple things: firstly, trying to find names that have changed, and secondly developing flat-file taxonomic hierarchies for each terminal taxon that I have. In this project, we’re going to subsample the morphological data and look at how using different subsets of fossils affects the tree and divergence times we estimate. Many ant fossils can only be keyed to a subfamily, due to incompleteness. What I want to have is an easily parseable list of what taxa we have molecules and morphology for.
I decided ultimately to have two different documents, a comma-separated values file of the taxonomy for each taxon from Moreau et al. (2006) and one for all three of my morphology datasets. In each document, there is one ant per row. For the morphology data, specimens are reused between matrices. That means in my document, the same ant may appear multiple times. I considered only representing each specimen once, but for the fossil specimens, the estimated fossil age might vary across data sources, so I’ll want to have those records separate. I don’t really want to have different rules for recording fossil and extant groupings, so I decided across the board to have one occurrence per row, even if individuals get duplicated. When I parse the data in Pandas, it’s simple to group the data down to unique individuals, anyway. In that light, I feel the propensity for confusion or data loss is higher with a priori winnowing out individuals. And the data set is pretty small, so data loss is a big concern.
The last main issue with data cleaning for this project is figuring out which morphology characters are shared across which datasets. The Baroni Urbani dataset is the source for a lot of character descriptions used in the two later papers. So I started there, and copied the character descriptions out of the Baroni Urbani paper to make a plain-text version, and then reduced that to simply the descriptions (some of the originals are quite verbose). I did the same thing for the Brady matrix.
The Grimaldi matrix was very concise, so I used it as the basis to which the other two matrices would be reconciled. What I ultimately wanted was a CSV file with a character name, the matrices in which it appears, and the states. For each character in the Grimaldi matrix, I looked at the Baroni Urbani and Brady matrices and recorded if the character was present in each, and if so, what character number it was. Ultimately, I think this is a fairly useful piece of information because I’ll be able to tell quickly where a character came from. This will be useful as I’m remixing datasets across taxonomic levels.
In the net few weeks, I’ll be establishing my workflow for obtaining molecular sequence data. Also, the Brady matrix has double the amount of characters as the Baroni Urbani and Grimaldi matrices, and keys specimens down the the genus level. How to best make use of this data is still something I’m wrestling with.
Additionally, I’m currently working on getting good age estimates on the fossil data. I’m anticipating having enough data to do a quick-and-dirty FBD tree by the end of the week.
A Quick Note
I’ve shown a few documents in this post. They are very much works in progress, and a real adult might come along and tell me I’ve done something very flawed. I would not advise trying to actually use these documents for anything at this time.