Paleantology postdoc April Wright wrote a short piece for the Data Carpentry website on developing and maintaining resources for biologists to learn about programming and data management. Data Carpentry is always looking for learners, mentors and project maintainers, so head on over to find out how you can play a role in this community!
Recently, a paper came out discussing the prevalence in errors in data spreadsheets associated with publications. Many of my colleagues have taken that as evidence that Excel should not be used, in favor of programmatic solutions like R or Python. And that’s pretty well how I feel – I usually only use Excel to view data. When I do have to use Excel to enter data, such as data that cannot be obtained programmatically, I try to stick to the Data Carpentry guidelines.
That said, I think Greg Wilson made an excellent point here, that many of these same errors could occur in programmatic data analysis. It seems obvious to me that the solution here isn’t stop using Excel, unless you’re going to make an investment in training your workers to use programmatic tools correctly. To that end, I’d like to introduce our new undergraduate researcher, Krishna Gandikota. Krishna is working with me to develop a small parsing program to pull down taxonomy data from the community resource AntWiki, and parse it to fit into my current data structures.
This will be a useful tool for several reasons:
- We can avoid the errors introduced by entering data by hand
- We can save time over searching tons of ants by hand
- Others will be able to use this tool for their work
And, in his own words, here’s Krishna and what he’d like to obtain this semester with the Paleantology project:
I am a sophomore in Biomedical Engineering pursing medical school. As someone who aspires to go into the medical field I wanted to take a step into the concept of “research”. I hope to widen my understanding of biology and obtain some new skills in software and coding.
Nantucket DevelopeR Workshop
Edit 2019: We’re teaching this again. Viewers at home can follow along here, and we’ll post a wrap-up next week.
This past August, I had the chance to head out to the UMass Field Station on Nantucket to teach a week-long phylogenetics methods development course, in R, with Klaus Schliep. The course was sponsored by Liam Revell‘s CAREER Grant (DEB 1350474).
This was a fairly unique learning environment, and I struggle with what to call it. Was it a workshop? A hackathon? Lately I’ve taken to calling it an immersive research course. We met the first day to talk about tree structures in R, collaboration with git, and to do lightning talks. Subsequent days, we met for an hour to an hour and a half in the morning to discuss some aspects of R phylogenetic methods developments, including testing, profiling and writing documentation. In the afternoon and evenings, people worked on independent and group projects in R. And they worked hard, often being at their computers until 8 or 9 PM.
I wanted to write a little about what worked and what didn’t. Intermediate-to-advanced training is something that is perennially in need of improvement. Beginning learners have excellent resources – Software Carpentry/Data Carpentry, and often training at their home institutions. The summer courses at institutions like the MBL often reinforce computational skills. But courses that go on to the next steps – developing novel scientific functions, looking at integrating together all the functions you’ve created, working collaboratively – are still fairly rare, especially in a domain-specific context where you work with people who are actually into the same things you are.
Without further ado, what worked:
- The facility. This place was amazing. I was having a conversation with another postdoc at the meeting on a vista overlooking the ocean, and the only thing I could say is ‘If this whole academic endeavor doesn’t work out, look at where we get to be right now.’ There was one building of facility (though I was traveling with my husband and daughter, and so stayed off-site), and it kept everyone together, which was great.
- The mix of learners. We had a very good mix of the types of organisms and data people work with, their skill levels and the questions they want to answer. In addition to learning something about computation, I think everyone learned something cool about biology.
- Work time. Something that is a really hard thing to get right is the amount of time to work that learners should get to work on projects. We hit the right balance – people more-or-less went to work after lecture time and worked until after dinner. Klaus and I more-or-less let people set their own schedules and only really interrupted for dinner.
Without further ado, what we can improve on:
- Lecture time. We made this optional, but from feedback, a little more structure would be good here. From feedback, that structure can be fairly minimal, maybe even just being a little tighter with start times.
- One suggestion that I thought was great was to do a package dissection, and look at the structure of an existing package. What a neat idea!
- Scaling. We had 10 learners. We might not change the world with 10 learners. And that’s my motivation in writing this post – we had really positive feedback on this workshop. Clearly it ought to happen more. Products from the workshop are already emerging. So I’d like to make the argument – if you sit on a pot of money of some kind, think about what you’re doing to address the intermediate training gap.
There we have it, folks. As an instructor, I had a great time working with this course. It’s something I’d like to do more of, or develop as a college course in the future. I’ve always loved working with novices, but it turns out intermediate computing education is pretty fun, too.
Preparing the Data
Beast2 is a software for co-estimating phylognetic trees and divergence times. During this project, we’re going to be generating a lot of BEAST2 files as we add data and explore data subsampling. Beast2 has some challenging facets, such as that all your taxa must be present in all of your data subsets. For my data, this means all my taxa that have morphological data need to be present in the molecular partition (with question marks as data) and vice versa.
Assembling those datasets by hand can be really tricky, and most tutorials on using BEAST2 start from the assumption that you’ve already done this step. I’ve uploaded two iPython notebooks that take you through the process of crunching and combining data. The first covers assembling all the data into one Beast file. The second covers subsampling the fossil data and making Beast files from a subset of the data. iPython notebooks don’t render well in Github, so you’ll have to download the repository to play with them.
Hopefully these notebooks are helpful for those of you who do combined molecular-morphological analyses in BEAST2. If you’re interested in how I organized the data to be able to do this data crunching so quickly, see here.
Paleantology is funded under the NSF Postdoctoral Fellowship in Biology collections track. There’s going to be a lot of data involved. That data will come from a variety of sources:
- Collections: Characters and dates from the ant collections at the Field Museum
- Previously-published matrices of ant morphology.
- GenBank: nucleotide sequences will be scraped from here.
Between these data sources, there are a lot of challenges. Some morphology matrices are only published in PDF form, which makes even obtaining them a challenge. There has also been considerable taxonomic revision in ants, and not all phylogenetic trees are built at the same level. Particularly for the morphological data, the tips on a tree may only be keyed to the subfamily, tribe or genus level.
Below, I’m going to discuss some of the challenges associated with curating combined data sets.
Liberating the morphology data
I’m going to focus on three main morphological matrices: Baroni Urbani, Bolton and Ward (1992), Brady and Ward (2005) and Grimaldi, Agosti and Carpenter (1997, link goes straight to a download of a PDF). The Baroni Urbani matrix forms the basis of the other two.
The latter two matrices were able to be digitized using a fantastic tool called Tabula. Tabula finds X and Y coordinates of tables in a PDF, and uses this to extract them to a flat file (I exported my matrices as PDF). The Baroni Urbani matrix, however, is saved as an image file. If you look in the paper, you might also notice the table is color-coded. That makes it really hard to use optical character recognition, because contrast of the character to the background is poor.
So I had to extract that by hand. I’ve never done hand-translation of a matrix before. Here are my tips:
- Print out the matrix.
- Put a sheet of paper under the line you’re working on.
- Turn off push notices. It’s really easy to get distracted and lose your place.
- Work through the matrix in pieces (in my case, there was a page break in the middle of the table). Error check several lines from each piece before moving on to the next.
Taxonomic Name Resolution
This subject header is misleading, because I’m doing a couple things: firstly, trying to find names that have changed, and secondly developing flat-file taxonomic hierarchies for each terminal taxon that I have. In this project, we’re going to subsample the morphological data and look at how using different subsets of fossils affects the tree and divergence times we estimate. Many ant fossils can only be keyed to a subfamily, due to incompleteness. What I want to have is an easily parseable list of what taxa we have molecules and morphology for.
I decided ultimately to have two different documents, a comma-separated values file of the taxonomy for each taxon from Moreau et al. (2006) and one for all three of my morphology datasets. In each document, there is one ant per row. For the morphology data, specimens are reused between matrices. That means in my document, the same ant may appear multiple times. I considered only representing each specimen once, but for the fossil specimens, the estimated fossil age might vary across data sources, so I’ll want to have those records separate. I don’t really want to have different rules for recording fossil and extant groupings, so I decided across the board to have one occurrence per row, even if individuals get duplicated. When I parse the data in Pandas, it’s simple to group the data down to unique individuals, anyway. In that light, I feel the propensity for confusion or data loss is higher with a priori winnowing out individuals. And the data set is pretty small, so data loss is a big concern.
The last main issue with data cleaning for this project is figuring out which morphology characters are shared across which datasets. The Baroni Urbani dataset is the source for a lot of character descriptions used in the two later papers. So I started there, and copied the character descriptions out of the Baroni Urbani paper to make a plain-text version, and then reduced that to simply the descriptions (some of the originals are quite verbose). I did the same thing for the Brady matrix.
The Grimaldi matrix was very concise, so I used it as the basis to which the other two matrices would be reconciled. What I ultimately wanted was a CSV file with a character name, the matrices in which it appears, and the states. For each character in the Grimaldi matrix, I looked at the Baroni Urbani and Brady matrices and recorded if the character was present in each, and if so, what character number it was. Ultimately, I think this is a fairly useful piece of information because I’ll be able to tell quickly where a character came from. This will be useful as I’m remixing datasets across taxonomic levels.
In the net few weeks, I’ll be establishing my workflow for obtaining molecular sequence data. Also, the Brady matrix has double the amount of characters as the Baroni Urbani and Grimaldi matrices, and keys specimens down the the genus level. How to best make use of this data is still something I’m wrestling with.
Additionally, I’m currently working on getting good age estimates on the fossil data. I’m anticipating having enough data to do a quick-and-dirty FBD tree by the end of the week.
A Quick Note
I’ve shown a few documents in this post. They are very much works in progress, and a real adult might come along and tell me I’ve done something very flawed. I would not advise trying to actually use these documents for anything at this time.
First blog post
Ants are an extremely diverse group of organisms, and are interacting partners for many different species. They’re important to our global ecology and human agriculture – and they’re quite cool to boot.
Our project is operating at the nexus of phylogenetic methods development and empirical biology and paleontology.
What are our goals?
Firstly, we’d like to revisit the ant tree of life. There has been a lot of really fantastic work on ant relationships, so we can stand on the shoulders of giants here. Particularly, we’re interested in applying new fossilized birth-death models that more completely incorporate morphological data to update our understanding of the timing of divergence events in the ant tree.
Secondly, ants have a fascinating fossil record. Ants live in a variety of environmental conditions, which affect their preservation. Some ant fossils are more complete than others, and some ant fossils can only be keyed to higher taxonomic levels. All of these biases in data availability can affect the way in which we model our data. We’d like to get a better handle on how best to manage these biases in the empirical data.
Lastly, the Heath lab does extensive software development. In the course of this project, we will be developing software tools for posterior predictive model assessment, as well as extending existing methods to model data with greater biological realism.
Our full proposal can be found here.