British-born Richard “Dick” Lewin Wife followed a traditional educational path, receiving his chemistry first degree from the University of Leeds in 1969 and staying on to do an organic PhD with David W. Jones. Research fellowships then took him to London, New York, and finally California, after which he returned to a job in the UK with Shell in 1976, moving to The Netherlands with the company in 1979. He stayed with Shell until 1987 at which point he founded SPECS and BioSPECS BV, in The Netherlands. In 2005, he co-founded a new company, SORD, which is an active partner of ACD/Labs, former host of Reactive Reports.
You started out as an organic chemist, working for Shell, what made you switch to chemical IT?
At Shell, I worked in “Corporate Research” (1976–1983) and in the end transferred to an applied department (1983–1987) before leaving to start the company Specs. During my time at Shell I was amazed at how difficult it was to search for known chemistry (compounds and reactions). In those days, Beilstein was just a lot of books with some crazy and complicated indexing system while Chemical Abstracts searches were done by experts who understood how to extract information from the CA database. No surprise that most chemists never bothered to check the literature for what had already been done! There was no Internet and no e-mail as we now know them—these were the “dark days” and chemists had been particularly bad in making their information accessible. It was a misery. But, I did not jump from Shell into chemical IT. I started Specs and the business of supplying industry with compounds from academic research—the chemical IT would come later. And, by the way, I am not the IT genius you need for SORD! We are very fortunate to have two really excellent IT specialists who are also very good chemists.
How would you compare developing an organic synthesis and writing a computer program?
Since I have never written a computer program, this is not an easy question to answer. What I do know is that developing an organic synthesis is as much an art as a science. Of course you need knowledge, gained from a good basis of chemical information, but on top of that you need some feel for the molecules, for the functional groups and which of them will react the way that you hope for and which ones will remain “silent”. Chemistry is full of surprises and wit, you cannot help but appreciate this. If you don’t you soon become very depressed. Chemistry is a sophisticated game, a challenge, that always produces a result but not necessarily the one you had hoped for. What is important is communicating the results, good or bad, to the rest of the scientific community and this is something that chemists are especially bad at. Even the surprises are written up as being what the author expected to happen. And the failures are never documented.
In what ways did working at Specs provide a grounding for joining the SORD management team?
Way back in the mid-nineties, I started the calculations about what had been done (how many reactions/compounds) and comparing the results to what was accessible in the open literature. Alongside this, we looked at the compounds we were buying from academic sources and checked these against the public databases. No more than 10% of the compounds we were buying were in these databases! That’s when I realized how massive the problem was—so much undocumented chemistry and such a chaotic mess (the commercial databases). At Specs we never had the resources to do anything about this but that is where the idea was born—someone, somewhere had clear up the mess!
When did SORD begin looking for lost chemical reactions?
The term “lost chemistry” I first coined in the mid-nineties, mainly as a means to make the world wake up and appreciate how much chemistry has been done but not reported in a way that other people could access it, but also to point out how much chemistry there still was to be done. We had some algorithms that gave insight into the missing compound structures in chemical space. It painted a depressing picture—150 years of organic chemistry had produced so little compared to what was actually feasible. And since 80% of what had been done was not accessible, the picture was even more depressing. Sadly, this was a message that no-one wanted to hear. We had discovered lost chemistry and no-one was that interested. Then we started to access it, and that is when people started to get interested. We were not just accessing it but we were presenting it in a way that people could data-mine! We were the first company to do this.
What problems does the company face in tracking down reactions?
None! We have made a very simple but also a very appealing commitment to the academic community who supplies us with their lost chemistry (from theses and dissertations). If they supply the information, we give them free access to the SOR Database! It’s as simple as that. Over the years, the relations between universities and publishers of chemical databases has somehow deteriorated. We are investing in this relationship and have been able to get their full cooperation in terms of access to the chemical knowledge that is controlled by them—tracking down the reactions is not a problem.
Are “published” reactions included? What about patented/commercial reactions? How do you avoid duplication?
There will be published reactions in the SOR Database—it is not our business to collect only the reactions that have not been published. But think about the proportions. Of the chemistry in theses and dissertations, around 80% of the chemistry has not been published. We select the theses and dissertations that are interesting to the pharmaceutical companies and these are processed and entered in the SOR Database. Working on averages, this means that 20% of our data will have been published in some shape or form. Given that Notes, Communications or Letters do not contain sufficient synthetic detail for an easy reproduction of a synthesis—not to mention those publications in non-English publications—the duplication is going to be less than 20%. But the most important distinction is that of all the commercial databases, only the SOR Database is complete (containing everything from the experimental procedure) and only the SOR Database is formatted to allow data-mining. A small amount of duplication is not a problem since these data are not data-mineable in the commercial databases. Patented and commercial reactions will enter our database. If this is a problem and there is just cause for this information to be removed, we shall do so.
What do you consider the main aims of SORD and how do you help the company to achieve those aims?
Besides making the lost chemistry accessible to the scientific world, our clear goal is quality. Most databases are a victim of trying to create size and neglecting quality. Those days are long gone—you must deliver quality now. This means reaction data that is validated (at source), accurately processed into electronic format, and that can be data-mined. Another main aim is to do this in a way that the academic community supports—mainly due to the strategy that we adopted to make the information freely available to the people who supply us with the data. So it evolves around quality and Open Access. This is what the whole company believes in. We have seen how others have cut corners or simply ignored issues such as integrity. We have chosen to develop SORD along the lines I have described. It means extra work and extra costs but the rewards make this entirely worthwhile.
Is there another comparable system?
How might other information sources containing latent information be tapped?
We do not intend to extract data from other systems. They simply do not contain the detail of the validations that we demand for the quality standard we have chosen. We extract from primary data sources, this we can trust and this we can process.
The SORD website states that there have been 50 million reactions carried out in academia, how does that figure square with CAS claims of 30 million+ entries in its database?
Chemical Abstracts are very mysterious about how many characterized compounds there are in the database and how many are actually organic compounds. It is a mess. We did an analysis of the chemical compounds in Beilstein and came out with less than 500,000 that you could begin to call “pharmaceutically relevant”. The 50 million reactions/compounds we refer to are derived from various analyses and these are now accepted by the people who did the work – the academic community. CAS can continue to claim what they want. If they want to make their data mineable it is up to them.
How does SORD ensure each and every reaction is validated, presumably many reactions in theses and dissertations, etc., are never reproduced or tested again?
Validation is an important issue. We believe that the data contained in theses and dissertations are much more validated than the data from publications. Referees have very little time to check a manuscript. The validation procedure for data in a thesis or dissertation lasts more than three years (internal validation) and then comes the external examination. We are not going to repeat every reaction before we accept it and no-one expects us to do this. One hundred percent certainty is not an issue. Making the best of what is available is. And what has been done until SORD came on the scene is lamentable.
Why is the SOR Database the “new standard”? Is that a standard of format adopted by others?
What we mean by the “new standard” is the issue of quality (see above). It is the highest standard and we would hope that others will migrate in this direction.
You mention Open Access on the ACD/Labs site? OA is for the academic collaborators, I presume, how far does that extend?
Yes, academics have Open Access if they contribute their data. If they choose not to, that is their decision but it also means they cannot access SORD.
How did your collaboration with ACD/Labs come about?
ACD/Labs is a company we can readily identify with, they have the same qualities that we have and they are equally visionary. There are not that many other potential collaborators, or indeed companies, in this market. Some are eminently unsuitable and I do not need mention who they are—your readers will already know them. Only ACD/Labs shares our vision about the future and about how we (both) should develop it. The combination of chemical knowledge (content), and the sophisticated methods to data-mine this knowledge provided by ACD/Labs, is very powerful indeed.
Will the database ever be “complete”, so that all previous reactions are catalogued and SORD is simply then adding new ones as they are published?
Never! We begin with the legacy of the last 40 years but also over the whole world. This is a gigantic operation. Nevertheless we do look forward to the time when chemists will submit their very newest data to SORD. This will be a time when the current publishers wake up to the world around them and realize that they can no longer shackle their contributors as they have done in the past. Times are changing and what was a condition in the past will no longer be acceptable. But this is also not our problem; it is a problem that the academic community must solve. What we can do is set an example.
What other issues surround lost chemistry?
In a small country such as the Netherlands, and in the last fifty years, more than €1 billion (about $1.3 billion) has been spent on chemistry PhD students who detailed their work in their theses and dissertations. With as much as 80% of this work not being accessible because it was not published, we are looking at a major spending and no way to get hold of the results! Using the technology we have developed, we could convert this lost chemistry into searchable electronic formats for less than €1 million ($1.3 million), a tiny fraction of the cash it took to produce this data. Now, extrapolate the figures for a world-wide picture! Just imagine having all of the world’s chemistry on your desktop and being able to search it, retrieve it, and use it. We would be delighted to take this task on, just as soon as people realize how valuable this lost chemistry is!