Interview with Internet Chemist Peter Murray-Rust

Interview by David Bradley

ISSUE #50
November - December 2005

Peter Murray-Rust

Peter Murray-Rust, originally a crystallographer with a DPhil from Oxford, has worked at the University of Ghana, the University of Stirling, and at Glaxo where he developed new technologies including molecular graphics, protein structure determination, and intranets. He ran the first multimedia virtual course on the Web (Principles of Protein Structure) at Birkbeck College London and was a virtual chemist at Nottingham University. He and colleagues Henry Rzepa and others pioneered the chemical MIME type for the Internet, which has enabled software and chemists to interact more intuitively. He is a keen supporter of XML and its chemical cousin CML and created a CML browser known as Jumbo.

Currently, he is at the University of Cambridge and is helping to establish novel software and Web technologies for chemists and other scientists underpinned by the concept of open source.

Peter Murray-Rust What O/S do you use on your workaday PC?

Windows XP but I have also tried Suse Linux. That doesn't yet cope fully with having to travel and present at meetings.

To what site do you have your browser homepage set?

Google; although this was not deliberate.

Aside from this interview, what are you working on today?

Catching up from ten days involving four different visits (it's Sunday). (a) To the fifth anniversary of the Chemistry Development Kit (CDK) in Koeln, Germany, where we are trying to develop interoperable OpenSource software. Also, we are continuing to develop the Blue Obelisk movement which helps to unite people and projects with this vision. (b) The science, technical and medical (STM) publishers meeting in Frankfurt where I argued that scientific data should be "Open" - i.e. any data published should not belong to the publisher. I got an unexpectedly favorable hearing. (c) Materialsgrid - an eScience project (IBM, Accelrys, Cambridge, Daresbury, and Frankfurt) for high-throughput computation of materials properties. This will have a major output for CMLComp (Chemical Markup Language for Computational chemistry) and dictionaries based on it. (d) Advances in scholarly publishing, again to argue for OpenData and again very well received.

World Wide Molecular Matrix
(Click image to magnify)
What are your long-term goals in your field?

To help create the Semantic Chemical Web where machines can understand and execute chemistry. This will happen when all publishers encode their chemistry in XML/CML and add identifiers such as InChI to the publications. But they must move away from a print mentality to publishing rich compound objects (documents + data). There are signs that this might happen. If so, it will completely transform the practice of chemical information and means that chemistry could lead the world once again.

What is the single most essential piece of software for chemists?

A semantically rich tool for creating and reading chemical information. It should be mandated by chemical publishers and freely supplied to all authors and readers (both of whom can be machines as well as humans).

What makes it essential?

That almost all current chemical software is designed for use by sighted humans and has no formal semantics. This makes it difficult to agree how to exchange chemical information without loss and corruption. The publication process is our biggest chance to revolutionize chemical information

What uptake are you seeing of CML by the chemical and pharmaceutical industries as well as software vendors?

Uptake is strongest in fields such as bioscience, materials, among librarians, and in chemical publishing. The pharmaceutical industries are secretive so it's difficult to say what they do in-house. However, there is still little evidence that they want to exchange semantically rich chemical information. The same is true of software vendors and they aim their market at pharma, not academia. However, there is increasing pressure from regulators, etc. to use XML and therefore CML.

What do you see as the long-term impact of InChI?

Enormous. I hope that when every publisher accepts and encourages the use of InChI in primary publications then Google can act as the primary chemical search engine. The need for secondary abstracters will change and they will need to adapt.

Will we ever see a near-perfect search engine for chemists?

It's important to realize that our current vision of "search" is unfortunately narrowed by the concentration on organic molecules of interest to pharma. I think InChI + Google is near-perfect for exact chemical searches for 98%+ of organic molecules where the connection table is a very good platonic representation.

What about sub-structure searching?

I am hoping to develop an approach to substructure-searching using Google - there are several possibilities. "Substructure" depends on one's point of view - searching for ring nuclei is fairly easy using a modified InChI. For compounds like "glucose" (open-chain vs. ring) this will need to be encapsulated in chemical ontologies, hopefully fuelled by machine-learning. I believe that XML will open the search possibilities for finding: "a reaction with an equilibrium constant of...", "a ribose ring with a conformation not hitherto observed," "a compound which has 99% of the same electron density as my search compound", "a functional group which is likely to be resistant to reagent X", and so on...

What do you foresee happening in the arena of Open Source development and online access to chemical databases in the next few years?

I see databases being replaced by open publication into the Web environment and this is already happening in bio- and other sciences. In principle, this can happen in chemistry if the attitude is right. There is always going to be a formal need for reference such as patent data but for current awareness Google is likely to be more valuable than conventional human-created databases. Open source acceptance is purely cultural. Most sciences welcome it. In the library world and bioscience it is assumed that all new developments are open source. But in chemistry there are entrenched attitudes such as, "If it's free it can't be good," "Chemists shouldn't be writing software - they should be doing research," "Chemical software isn't worthy of publication," "We are not interested in informatics, that's up to suppliers," and so on.

How can chemists in academia and industry reconcile the concept of Open Source with commercial reality?

I think the main virtue of open source is that it allows and promotes innovation. At present we have stagnation in chemical software as there is no open discussion of the need for development. Moreover, academia is an unimportant market; the pharma industry is much larger, so there is no channel for academic ideas to promote innovation. Also, chemical culture is often that software "does a job", rather like an instrument, not that it is a creative instrument. I think the pressures come from outside. Bioscientists are tired of the closed attitudes in chemistry - no access to source or data. So they are starting to do their own thing for the chemistry they need. Witness PubChem. I hope the same happens for software. Maybe our own efforts will resonate with what they need and will get incorporated into the communal effort. I am an optimist and hope that industry will change its approach. Much of what they do is integration of existing processes with closed methods and closed source.

Why is that a problem?

It prevents interdisciplinarity and leaves their information and creative thinking dictated by the software and informatics providers. And these in turn are primarily motivated by finding the areas where there is most return for effort - nothing wrong per se, but it often fails to innovate. I see some small flickers of change but I suspect it has to come from outside. However, when the current processes and data crumble, as I expect them to in the face of the semantic revolution, there will be exciting opportunities for Open source and Open data.