Reflecting on the aims of an ambitious plan and the thorny challenges therein.
Writer: Joshua Williams
Editor: Altay Shaw
Artist: Sophie North
A 2016 Kew report estimated the known number of currently existing plant species to be just under four hundred thousand. Since then, one branch of their Plant and Fungal Tree of Life project (PAFTOL) has sought to understand the evolutionary relationships in this incredibly diverse group. The project envisions a powerful collaborative resource, one that is flexible, reproducible, and contains high quality, up-to-date information.
This inspiring goal was set out in a 2018 roadmap publication in the American Journal of Botany, which identified both the key requirements in a set of calls to action, as well as potential challenges. Such obstacles face not only this project, but reflect the broader difficulties of bringing large numbers of people and organisations together with a common goal. In this way, the many considerations here serve as food for thought for a wide range of endeavours, scientific or not.
Firstly, the project must cater to a number of varied use-cases depending on the experience and role of the user. The data should be accessible to both unspecialised users such as educators who might require just a basic phylogenetic tree, as well as for specialist researchers, integrating external tools for detailed tree modification and customisation.
The project also needs to be accessible – if the data are unavailable to certain users (for example, by having a steep paywall) then the project will struggle to take hold globally. Furthermore, opening the data to all will encourage greater education and thus promote interest, creating a positive feedback loop to support a new wave of research, especially crucial for current global challenges like food security. This accessibility extends to use of the resource. Navigation should be intuitive, and technical details clearly annotated for non-specialists. Adequate guidance and help for newcomers to the resource should be easy to find and building a large community would encourage the sharing of techniques and ideas to further improve the resource.
Principle amongst the challenges is the continual upkeep of such a database since it would be a tragic loss for such concerted effort to go to waste as soon as the project’s funding ended. Project leaders suggest the need for constant funding from institutions worldwide to ensure the effective upkeep of the resource following its creation, something undoubtedly difficult to secure. The project is proposed for the greater good of research, rising above competitive and monetary interests, but without these there may be lack of motivation for funding bodies to support the efforts.
Additionally, the project needs either to resolve or make clear conflicts in the literature, as well as demonstrating lack of knowledge in some cases. Along with this, it will be necessary to dedicate resources to moderating and validation of phylogenetic trees, computationally and/ or manually, since errors or missing data could have major consequences for further research.
This leads to the issue of data continuity. Not only do different geographic regions, institutions and even labs store data in different formats, but additionally different genomic markers are often used in phylogenetic studies. Since reconstruction of phylogenetic trees relies on comparison of genetic markers between the species, a lack of prior consensus on these markers means many of the previously studied species will have to be re-analysed. Whilst this is not entirely redundant, it will surely slow the project’s progress.
A final challenge for the project is the need to design new algorithms for phylogeny. Much of the software currently used cannot accurately handle the hundreds of thousands of species necessary for this project. More importantly, if these algorithms disagree with existing studies about the placement of individual species or taxa, there must be processes to absolve these disagreements whilst retaining the integrity of the algorithm as a whole.
In February 2021, the first data release of the project was published, reporting a phylogenomic dataset of 3099 angiosperm samples, the most extensive of any such database to date, with members from 96% of known families and from all 64 orders. This work is presented in an open data portal named the Kew Tree of Life Explorer and serves as an impressive precedent for future work to follow.
As an overview, the project lays out a pipeline of steps which involve gathering data online, studying it for phylogenetic reconstruction, storing it securely and efficiently, and then presenting it for public use. Each stage of the process presents a huge and challenging task, but I hope that the project finds success, as this sort of centralisation will be invaluable for efficient progress in a broad range of plant sciences research, as well as paving the way for similar projects in other phyla.