PhD – Positive, Happy, Developments

RightOrWrong1921

When wrong is right part 2

This post follows on directly from my previous discussion of my PhD going wrong. As a brief summary of the previous episode: I ran time consuming simulations that took me around 6 month to design and another 6 months to run. The simulation failed in the end because of a bug in some of the software I was using. Therefore, I had to run them all over again!  That took me one day (at least to relaunch it, the simulations are actually still running). In this post I’d like to focus on the importance of starting to enforce good habits in using computers from the start of your PhD, whether you’re doing bioinformatics or field ecology.

Coding facilitates life. A lot. If I could only offer two tricks to remember they would be:

Writing function-based scripts: which involves isolating functions (the bits that are actually doing stuff) from scripts in order to be able to reuse/modify them easily for further/new analysis.

Using version control: which involves saving your work as you modify it and keeping a good track of the history so that when something goes wrong you know exactly which one was the last version that worked and which is the version that bugs.

There are loads of other good tips and many excellent blogs about how to start good coding habits (for example, this one or that one) so I am not going to develop the point here.

I’ll just try to make the point by using a philosophical-historical-dodgy example that convinced me to start coding. Coding is like using a printing press vs. a pencil to write a sentence: I can write this sentence of 71 characters in approximately 16 seconds. And that is, with a pencil. If I had to use a printing press, it would take me one second to input each character in the press (assuming I trained a lot) plus one seconds for actually pressing the sentence. So that’s 16 seconds with a pencil and 72 seconds using the printing press (4.5 times longer). If you’re not that old-school, you will use a computer to analyse your data and what often happens is that it will take you less time to do things “by hand” (e.g. modifying column names, removing rows with NAs, etc…) than to write fancy functions. So why bother?

Well it’s the same as using the printing press, if you just want to write the sentence once, then, sure, don’t bother, but if you need to write it 10 times? The writing would take 160 seconds and the printing takes only 81! Also you’re likely to make typos when copying the sentence with a pencil, but you won’t make any with the press!

And the same applies to your computer analysis. If you’re removing columns with NAs “by hand” it will probably take you less time than writing a function. But what if you have more tables? How can you be sure you didn’t miss any? And on the plus side, if you write function-based scripts, chances are that you already have a function that does remove the columns with NAs from a former analysis.

To follow up with my previous post, applied, to me, this happened to be a salvation! Because I spent 6 months trying to apply bioinformatics good practice, it only took me one day to relaunch the whole analysis! I just had to change the name of the version of the software that was bugged and press enter…

            The process of doing actual science (i.e. from coming up with an interesting idea to submitting the paper) is not a continuous and straight process and it can drastically change at every step and is more about trial and error than about succeeding straight off.

Author: Thomas Guillerme, guillert[at]tcd.ie, @TGuillerme

Photo credit: wikimedia commons

 

PhD – Pretty Huge Disaster

Dresden

This is a mini series of two posts about finding positive things in negative results. Science is often a trial and error process and, depending on what you’re working with, errors can be fatal. As people don’t usually share their bad experiences or negative results beyond the circle of close colleagues and friends, I thought (and hope!) that sharing my point of view, as a PhD student might be useful.

If you’re about to do a PhD you will fail and if you’ve already successfully finished one, you have failed. At least a little bit… come on… are you sure? Not even a teeny tiny bit? By failure, I just mean scientific failure here, as if you ran an experiment and the result was… a fail, no results, do it again. There are millions of ways to fail, from errors in the experimental design to clumsiness but in this series of posts, I want to emphasize the consequences of failure more than its causes. I think that it is an important thing to learn and to embrace as a young future scientist, as much as journal rejection and other annoying and common silent academic failures.

During the two first years of my PhD, I went from the idea of quickly testing some assumptions as a starting point for a bigger question to some detailed and time consuming simulations on a detailed part of these assumptions. The time spent appeared to be completely useless scientifically because the analysis failed leading to false negative results and kept me away from going back to the bigger question. Or did it?

When wrong is right part 1

Since the summer of last year, I was working on an intensive computational project. I was running a kind of sensitivity analysis to see the effect of missing data on the phylogenies that have both living and fossil species (that’s called Total Evidence to link back to former posts, here and here). In brief I was simulating datasets with a known (right) result by removing data from it to see how the results were affected. Because of my wide ignorance at the start in coding, simulations and the method I was testing, the project took way longer than planned. And all that was of course ignoring Hofstadter’s law (‘it always takes longer than you think, even when you take into account Hofstadter’s Law’).

The expected result, as for any sensitivity-like analysis, was that as you reduce the amount of data, the harder it would be to get the right results. That wasn’t what I found at all. Instead, my simulations seemed to be suggesting that whatever the amount of data, you never get the right results. Suspicious, I tried to check my simulations and asked advice from competent and talented people that helped me finding caveats in my project. But still, after checking and testing everything over and over again, the simulation results appeared to be the same: the amount of data doesn’t matter, the method just don’t work.

Even though these results were negative, they were intriguing and, if they were right, probably important because of the number of people willing to try the Total Evidence method over the last three years. From that perspective, I presented my results at the Evolution 2014 conference in Raleigh. There, I got even more comments from even more people but still, the results appeared to be right. Until one person that had a similar unexpected result suggested that should try an older version of some of the software I was using.

It appeared that person was right and all the weirdness in the results that I tried for months to fix, check and explain were caused by a bug in the latest update (don’t use MrBayes 3.2.2 for Total Evidence analysis, prefer the version 3.2.1).

After an obvious moment of relief, came an obvious negative feeling of having lost my time and how I should have given up instead of continuing to dig. But a posteriori, I’m actually glad of this misadventure and learned two really important lessons: (1) published software is not 100% reliable; always test their behaviour; (2) there is nothing more productive than sending your work to colleagues and experts for pre-reviewing. Even though, the bug appeared to be “trivial and easy to fix”, the amount of comments I had definitely helped improve both my understanding and my standards for this project.

Author: Thomas Guillerme, guillert[at]tcd.ie, @TGuillerme

Photo credit: wikimedia commons