The Connector of Open Science: A Talk With Antony Williams of ChemSpider

Before we begin, Dr. Williams, I would like to give readers a bit of background on why those of who are not chemists should know who you are and why ChemSpider is important.

I came to learn of you as I have been trying, as the saying goes, to “wrap my head around” the concepts of Open Science, Open Notebook Science and what seems to be a genuine revolution in scientific communication and the dissemination of scientific information. I have come in the past year or so to learn that there is a core group of people in the vanguard of this movement including Jean-Claude Bradley, Cameron Neylon, Andrew Lang, Michael Nielsen and you.

Open Science seems to me to be a genuinely transformative movement in not only science but in the information sciences and should be of interest to anyone with an interest in scientific research, scientific publishing, academia, scholarly communication, search engines, the relationship of new technologies to mainstream scientific societies, technological innovation, the Semantic Web, big data, social networking in the sciences, new tools for scientific and scholarly collaboration and so on. And throw in the fact that you have managed to create something important with very little (if any) grant funding that caught the attention of and was acquired by the prestigious Royal Society of Chemistry.

We actually did it without any grant funding. ChemSpider was originally started as a hobby project so looking for grant-funding to fund a hobby project per se wouldn’t have been a good basis for any agency to contribute. Can you imagine that exchange with the NSF… “It’s something we’re interested in doing for the benefit of the community so would you mind kicking in a few dollars to cover the costs at least?” A discussion destined for failure. Even much later when we were listed on grant applications with collaborators we unfortunately didn’t get any funding. We did, however, have a number of companies step forward and sponsor the site because we were doing something of value for them, especially when we provided access to the data via a series of web services. Specifically Waters, Thermo, Agilent and Bruker all contributed and we thank them for that. It helped pay for the hardware!

Thus, we have in you we have the story of a basic scientist who also succeeded as a Web entrepreneur on the strength of a way cool tool and the respect you garnered as a scientist and innovator.

Thanks for the kudos. While I am a scientist I think my role over the past decade has been “Idea Guy” and “Connector,” connector being the Malcolm Gladwell sense of the word (The Tipping Point). I know a lot of very able people who are willing to move from the talking stage of doing something to actually getting it done. I’m very opinionated regarding when there is enough talking and thinking about doing something and it’s time to get things done. As I like to say “Try to feed the dog and the dog dies. If you ACTUALLY feed the dog he’ll be fine….but if you only TRY he won’t be.” Over the years I’ve developed a fairly good batting average in terms of starting a project, iterating, revisiting progress, stopping if necessary and redirecting as necessary. It’s my preferred way of working. It’s much easier to run a project this way outside of a structured organization, of course.

I envision as readers of this article all those who are, like me, trying to get a grasp of who’s who in Open Science, anyone who cares about the future of science (and that really should include every human being given the importance science plays a role in practically every breath we take, how long we live and like matters), plus medical, scientific and academic librarians, science educators at all levels, information scientists, computer science students with start-up stars in their eyes, those who love chemistry, those who need to keep up on key developments in the world of search and those who want to gain a better understanding of rather arcane terms such as “the Semantic Web.”

I agree with you…Science touches us all. I was listening today to President Obama commenting about how the USA is lagging behind on math and science skills. I have to agree. I’m British by birth and didn’t get to the states until I was in my early 20s but I’ve been here almost a quarter of a century now and I have sensed a change in the interest in science. We do amazing science in this country, just incredible. I get to meet a lot of kids but I can’t think of one who thinks its “cool and interesting” that I am scientist. Maybe it’s because I don’t mix chemicals anymore and don’t make “stuff.” Science isn’t exposed to the public the way it used to be in my opinion. I think there are a lot of people trying to interest students and making sincere efforts, but I am hoping that the new administration will shine more of a spotlight on the issue of lagging math and science skills and DO something to change it.

One of the pleasures of conducting interviews with brilliant people who create pathbreaking tools is that I get the opportunity to grill them on matters of terminology.

In that tradition, I would like to ask you first to give us a very simple explanation of what ChemSpider is and why non-chemists should know about it. For example, I work in the biomedical sciences and for my own interest follow research developments in the disease amyotrophic lateral sclerosis. Much of the work on that disease involves the testing of various chemicals on mouse models to see if they have any effect on preventing or curing the disease or slowing its development. Now, I would think that tools that greatly expedite the work of chemists at a very basic level would benefit clinical researchers down the road. Could you please outline for us why average people who are ill or love people who are should care about ChemSpider and the wider world of Open Science? Does chemistry really matter to those of us who aren’t chemists?

I’ll break this into pieces to answer your questions. Firstly, what is ChemSpider? At present it is primarily a large database of chemicals and related data linked out to the original sources of the data. What we’re been working towards is having ChemSpider be a “structure-centric” resource. If you want to find information associated with a chemical structure/compound and you know either its name(s) or its chemical structure then you can search the database and find associated information and data. The data are of various forms and include lists of chemical identifiers, experimental and predicted properties, analytical data, textual descriptions including synthesis procedures and Wikipedia articles, links to related information and data sources.

How might this be of interest to you? You likely know the drugs associated with the treatment of amyotrophic lateral sclerosis. For example, Riluzole is an approved drug and you can find it on Wikipedia. Doing a search on that name on ChemSpider will provide access to hundreds of patents, to a long list of PubMed articles, to property data, a long list of alternative identifiers and a long list of links integrating to tens of other databases. The structure is here.

In a similar way anyone interested in particular compounds or drugs and the information associated with the drug will be able to use ChemSpider as a search engine to access the information. The amount of data and number of data sources is increasing on an ongoing basis and you can consider ChemSpider as a unifying interface and aggregator of information and links.

It is also a platform for deposition and curation. Some of this will be detailed further in our discussions but any scientist can deposit their chemical structures and compound collections onto ChemSpider to share with the public. Various forms of data can be added and scientists can participate in the validation of data and links associated with chemical compounds. As they expand or assist in cleaning the data, everyone wins. The quality of the data improves and there are fewer chances of errors proliferating across the databases as data reuse expands via semantic web integrations. While ChemSpider data are not yet pure, a major challenge with over 20 million unique compounds (!) we continue to work hard on this and lots of chemists are helping out.

You asked “Why average people who are ill should care about ChemSpider and the wider world of Open Science?”. At present I would say that ChemSpider isn’t easily digestible by the public and that they’d encounter information overload in a similar way to that experienced searching the CAS Registry or PubMed. It is a system for people with experience in Chemistry but we do have intentions of delivering different “views” of the data for other groups to use – for example, students will benefit from our intention to deliver ChemSpider Education in the future. I believe that humanity as a whole should care about Openness in science as there is so much evidence from various scientific fields at this point that openness and access to data can be beneficial to analysis, generation of fresh hypotheses and international collaboration.

In terms of “Does chemistry really matter to those of us who aren’t chemists?” that parallels questions such as “Should we care about Mathematics? Who cares about art or literature?”. While there are of course many issues and side effects of chemistry that have been detrimental to the health of the planet “chemists and chemistry” will be at the forefront of healing the harm we have done. Whether the public are conscious of it or not chemistry continues to bring benefits to society in so many ways in the form of drugs, novel materials for a myriad of applications, for green applications such as battery technologies, biodegradable polymers, safer pesticides, fertilizers and so on. These examples are ones that the public would easily recognize. Chemistry is everywhere and it should matter to all of us.

Now let’s move along to questions of terminology. As I have read around on the Web in preparation for this interview, I have seen ChemSpider characterized in various ways. Could you please define for us the terms I have come across in articles about ChemSpider?

Structure Centric Community
Chemical compounds encompass a broad distribution. There are those that have been fully characterized and defined and can be represented in terms of a chemical structure diagram and in our specific case in the form of a “connection table” of atoms and bonds. There are then those chemicals that are materials with specific compositions, for example minerals, or a distributed composition, for example polymers with a distribution of molecular weights and end groups. ChemSpider presently is limited to dealing with “structures” that can be represented with a connection table. The community aspect is twofold: ChemSpider as a resource is provided for the benefit of the community but we also intend for the community to participate in the enhancement of the data quality and content.

Deposition and Curation Platform
ChemSpider is a platform where the chemistry community can deposit their own chemical structure collections and enhance the existing database by adding new data or curating existing information. They can add or curate chemical names or identifiers, add images (pictures of crystals for example), add analytical data such as NMR, MS or IR spectra, deposit textual descriptions of synthetic procedures and so on. The curation capabilities allow the quality of the database to be enhanced edit by edit and the multi-level curator pecking order allows for iterative checking and validation.

Publishing Platform
ChemSpider was extended to provide the ChemSpider Journal of Chemistry, a platform where “publications” could be deposited and enhanced with “Semantic markup,” the process whereby terms within the online publication are linked out to other resources online. In our case we focused on connecting chemical names to chemicals within ChemSpider, chemical terms to Wikipedia (e.g. reaction names) and embedding live analytical data. This is presently being extended to provide a platform for hosting synthesis procedures.

Interactive Platform for Chemists
ChemSpider is interactive in a number of ways including 1) the ability to extend, enhance and improve the data; 2) interact with live analytical data by using viewing tools such as spectral viewing applets; 3) using tools for the prediction of properties for structures submitted by users – these tools can be used even for compounds that are not in the ChemSpider database; post comments for any record so that the curators can comment and respond.

Chemistry Search Engine
Chemists use the internet to search for chemistry related information. They can be searching for various types of information and data including: what is the chemical structure associated with a particular chemical identifier, physical properties of the compound, analytical data, how to synthesize a specific material, where to buy a specific material and so on. ChemSpider has the ability to answer these questions and many more, though not for all chemicals of course. ChemSpider is more of a chemical search engine than a chemistry search engine at present…you would search for a particular chemical in a number of ways and then find associated data. A “chemistry” search engine would be more encompassing and not be limited to information limited to chemicals only. This is one reason we are moving into synthesis procedures at present and will expand further from explicit chemicals into more general chemistry in the future.

Database
ChemSpider sits of a database of diverse data associated with millions of chemical structures. The database itself is Microsoft SQL Server and what we have done is built a data model onto SQL server and populated the database with chemistry-related content.

As you can see, I am trying to figure out when something is a database and when it is a search engine. I work very happily on ScanGrants, but I still can’t figure out if it is a database or a search engine. We just use the wording, “a public service listing of grants and other funding types.” Could you please delineate for us the differences between a search engine and a database and is ChemSpider both depending on what operation the user is engaged in within it? Could you please give examples of what you would consider a database and what you would consider a search engine?

Interesting question. I think of a database as two things – the technology itself that hosts the content (in our case the database in SQL server) and then the data model and the data populated against the data model. Clearly SQL Server itself is unlikely to be of interest to chemists unless it is holding data of interest to them. A search engine is the manner by which people discover content and relationships that is generally, but not always housed within a database. The utility of ChemSpider is not just limited to searching a database as the platform provides access to a number of tools for the user that still have high value and do not involve performing searches or tapping into the existing data content. For example, we have a services page where a user can draw their own structure (or upload it) and predict a series of physicochemical properties for their structure. This does not depend on the structure being in the database as they are real time predictions and not data look-ups.
Examples of various databases would be eBay, Amazon and Wikipedia – they are all sitting on underlying database technologies and the user is interested in the content within. Clearly they all need to be searchable to retrieve information of interest to the user but they are not search engines per se. To me search engines are the classical internet search engines: Google, Bing and I extend it to include Mapquest/Google Earth.

In your very edifying slideshow, “How Internet Resources Are Providing a Collaborative Community for Chemistry” you have an intriguing slide entitled, “Crowd-sourcing chemistry curation.” Could you please talk about crowd-sourcing in chemistry and tell us if there is something unique to chemistry that makes it particularly suitable to crowd-sourcing or are you with ChemSpider and with your colleagues such as Jean-Claude Bradley creating models of collaboration that could be adopted by other branches of science? It does seem to me that Open Science at this point is of interest primarily to chemists and physicists. Will that always be the case or is there activity in fields such as neuroscience and medicine?

In terms of crowd-sourcing in chemistry our hope is that we can garner the support of the community to populate ChemSpider with their own content so that others can benefit from their skills and interests, the so-called “wisdom of the crowd.” However, since there is already so much data and information on ChemSpider we are also hoping that the community will help us validate and curate the data that is on ChemSpider. With over 20 million chemical entities on the database, many associated with dozens of names and properties, it is not difficult to find a simple misspelling, a property without units or a mis-associated spectrum. This amounts to millions of potential errors that cannot be validated algorithmically or robotically and human eyeballs and skills need to be brought to bear.

There is nothing specific about chemistry that makes crowdsourcing more amenable. Crowdsourcing is applied to the review of movies and books, to the review of Wikipedia articles and to the production of Open Source software. Crowdsourcing as a phenomenon is new, however, and is one of the benefits of the new platforms that have found their ways onto the internet and we will only see more of this in the future. Tagging of photos on Flickr is all about crowdsourcing too.

Jean-Claude Bradley, JC, is at the forefront of Open Notebook Science and is instigating projects that harness the collective skills of students to measure, publish and collectively validate experimental data. In particular this has been brought to bear recently in his Open Solubility Project where a number of individuals are measuring non-aqueous solubility experimental data and sharing the details and results of their measurements via a wiki. The data are then aggregated and served up the community to reuse and repurpose. Since all experimental data are available via links to the original measurement data, true Open Notebook Science, then erroneous results can be questioned, discussed by the community and highlighted for re-measurement or investigation. JC’s work in this particular area is unique but Open Science has been going on for many years in astronomy where large teams openly share datasets and collaborate. If you consider biology this is already going on through the sharing of massive amounts of biological assay screening data through the PubChem platform. In this way certain labs are screening particular compounds and making the data available and other laboratories are accessing the data and using it to investigate potential lead compounds. The same is true of the Toxcast work funded by the environmental protection agency (EPA) where there screening data are being made available to modelers to investigate algorithm development for example. Open Science is all around us today and only continues to grow in parallel with other areas of openness – Open Source and Open Access.

In that same slideshow, you use the rather intriguing term “lost chemistry.” Please elaborate on that.

I originally heard the term “Lost Chemistry” from a gentleman called Dick Wife. Dick has been running a project for a few years to aggregate from chemistry theses synthetic reaction procedures to build a large database of chemical syntheses. For every published chemical reaction there are many more syntheses reporting the experimental conditions, yields and analytical data that never make it outside of the originating laboratory but can be captured into a thesis. Dick has been heading a project called SORD to aggregate these data into a single database and make it available to the community. If you wish to have free access to the resource then you need to be a participating lab and share your data so that it can be populated into SORD. If you wish to access the data but not be an active contributor then you need to pay to access. This helps capture a lot of “Lost Chemistry.” As an NMR spectroscopist I have the same view of the number of spectra measured in a year around the world that never get reported and simply remain confined to hard drives inside an organization. We hope that people will take advantage of the ChemSpider platform to share their data and help prevent the “loss” of chemistry. Imagine how many experiments have been run in labs around the world where the data/description/conclusions are lost in notebooks on a shelf. I am not talking about long-lost information but data generated in the past couple of years where computer capture and data management capabilities could have helped expose the information. This will change. What we need to catalyze the shift is a couple of prominent thought-leaders in our domain to lead the way.

Again in that slideshow, you state, “ChemSpider accepts public depositions, linking to websites, hosting of details etc. Accepts structures, text, spectra, images.” Could you please give examples of each and discuss the challenges of quality control? How do you handle matters of link rot, for example? How has the acquisition of ChemSpider by the RSC helped you in such matters?

I think I’ve outlined earlier the concept of depositing information onto ChemSpider. We do have people submitting their own chemicals to the database now, regular associations of chemicals with publications, association of spectra etc. We have over 2500 spectra at this point. Quality control is rather simple but is hard work for a number of curators. We have different levels of curators and master curators check the curation efforts of members of the community who are depositing and validating data. Spectra are generally checked within a few hours of being deposited, images are generally checked within a few minutes, chemical names are re-checked within a day etc. Simple comments submitted against chemical records are available for everyone to see and the thousands of historical comments made are all viewable. Bottom line, we recheck everything. We have only had a couple of acts of vandalism and it amounted to people posting funny images (for example, the Katie Crowe incident or the Exploding Mouse). I’d estimate well over 95% of the edits and depositions are correct associations. I would say that 99% of the comments raise appropriate awareness to us regarding issues on a particular record. Overall, this is very impressive.

Link rot is an issue but we try to minimize it. For example, we link as many publications as possible via PubMed ID or, preferably, digital object identifier (DOI) and then use those identifiers to pull back the associated information via the appropriate services. We are at risk of broken links with things like blog posts but we tend to consult the blog for creative commons licensing and respect it, often depositing a relevant blog post onto the database and creating the link out to the originating blog post. If the blog changes later and there is no redirect then we at least do have the original source info on the database to load. Link rot is something we are trying to wrap our head around and in terms of data we foresee that application of DOIs for data, as Thieme have done recently for spectral data, could be an important approach.

You also say, “Blogs should be searchable too.” Are blogs (which can be deleted in seconds by the owners) some of the places that lost chemistry can be found? Are Open Notebooks basically blogs? Or are they wikis? Or what?

There are some wonderful blogs out there for synthetic chemists especially. I like the blog of Paul Docherty and “Milkshake” who runs the Org Prep Daily blog. Paul’s blog is generally a detailed analysis of a particular synthesis with links to the original paper. Milkshake posts details of particular syntheses and a lot of experimental detail. We copy the contents of both to ChemSpider and link out to the original posts. I would call this Lost Chemistry….many people had never heard of these blogs until I started pointing to them. I believe we have driven traffic to them. Open Notebooks could be blogs (like OrgPrepDaily) but in the pure sense of the definition from JC Bradley the wiki format with date and time increments displayed throughout the progress of the work is more appropriate. I would say most ONS pages are wiki-based at present.

One new buzzword is “Linked Data.” Is ChemSpider an exemplar of that concept? How does linked data differ from semantic linking?

For the purists I believe that Linked Data is the exposure of data using semantic web layers such as RDF (Resource Description Framework). For myself I believe that ChemSpider is indeed an example of Linked Data even in its present form. We are not yet exposing RDF on ChemSpider but have had it on our plans to do so. There is, (un)fortunately, always something else waiting for our attention.

In your very interesting slideshow, “Navigating the Complex Web of Chemistry Using ChemSpider,” you state on one slide, “Publishers can enhance their articles…” Could you please elaborate on that and tell us what you see as the value of Elsevier’s Article of the Future project?

The publishers are not naïve in thinking that “publications” are paper-based for not very much longer. So much of their delivery vehicles have had to be reinvented over the past few years as users have come to expect access to electronic forms of the article ahead of paper-based delivery. The American Chemical Society just this year announced the first step to migrate away from paper-based delivery and I’m not aware of any of the Open Access journals delivering a paper-based format. A number of the publishers are already active with delivering enhancements to the articles that can only come by enabling electronic forms of the article. I include the efforts of the Public Library of Science (PLoS), Nature Publishing Group (NPG) and the Royal Society of Chemistry (RSC). In Chemistry especially the RSC led the cure with their award-winning Project Prospect semantic mark-up project whereby certain terms within the article would be “marked-up” and linked to information such as chemical compound details, definitions in the IUPAC Gold Book etc. The ChemSpider team developed our own version of semantic mark-up, called ChemMantis, and linked out from chemical names, reaction names and biological entities such as proteins, enzymes, bacteria etc out to the associated Wikipedia record. We used the ChemSpider database as the navigating layer to help link from chemical compounds out to chemical vendors, analytical data and related publications. We ended up producing a “ChemSpider Journal of Chemistry” to house Open Access articles submitted to us by the community. Many of these were synthesis procedures and took advantage of our mark-up capabilities to enhance the article.

We are only going to see more efforts from the publishers in the near future to deliver enhanced articles to the community. That said, abandoning paper-based delivery will be an interesting decision for the worldwide community since the third-world is still rather restricted in terms of internet speed and presently depend on paper. Of course this will, with time, change.

I believe that Elsevier’s Article of the Future could be a good representation of what electronic articles might look like in the future. I can only envision that the production of such articles will be very labor-intensive for the foreseeable future until processes are optimized and authors assume more of the additional load associated with producing an electronic article of that form. Alternatively, and more likely, an increasing amount of the article formatting will be farmed out overseas due to a more advantageous price point.

What is the ChemSpider Synthesis and what do you mean by “all things synthetic?”

One of the outcomes of our delivery of the ChemSpider Journal of Chemistry was a series of submissions of synthesis procedures. By the third issue it was clear that there was a bias to using the platform to expose organic syntheses – short articles explaining particular reaction transformations…generally single step syntheses defining starting materials, products, experimental conditions and associated analytical data. Ultimately, it was turning into a “reaction database” of synthesis procedures. Our intention with ChemSpider Synthesis, an interim name for what we will deliver, will host short articles regarding synthesis. They will be peer-reviewed using a blog-like feedback system where comments will be posted to the article. The community will be asked to contribute to the content and, in parallel, we will harvest synthesis procedures from the RSC archive of many tens of thousands of articles.

Could you please discuss the concept of public peer review and how might that change the current quite rigid tenure system? How do you encourage young scientists to get involved in Open Science in general and ChemSpider in particular? Jean-Claude Bradley seems to excel at outreach to chemistry faculty and young scientists and you to scientific societies, database managers in government and industry and to publishers.

Public peer review is, in many ways, already happening. The blogosphere has fast become an environment where science and publications are openly discussed and critiqued in the public eye. Take for example the recent blog discussions regarding sodium hydride as an oxidant that were conducted on the TotallySynthetic Blog . This blog post openly critiqued science reported in JACS and in the process removed the journal and the authors from the exchange. Since such “peer review” is already occurring online, and the example given is not the only one, the near future will see peer review and commentaries being posted directly to a publication online similar to what we are seeing already in the blogosphere. Authors will be held accountable to their work not only by original peer reviewers but also by the community at large. This will take some gentle navigation in the early days to deal with potential flame wars and open attacks but ultimately is likely to become mainstream and expected from the publishers. Personally I believe it will be quite some time before such openness will have an impact on the tenure system and such a system is entrenched, in my opinion, in local organization politics and relationships over productivity and scientific impact. At the end of the day interpersonal relationships and conformance to an organization’s expectations and needs will define tenure conversion.

Speaking of Jean-Claude Bradley, could you please give us your views as to who excels at what vis-à-vis raising the public profile of Open Science? Do you agree with my characterizations here:

Jean-Claude Bradley: The main conduit and public face of Open Science to working chemists in academia. Ambassador and guide to Open Science to non-scientists. Explicator extraordinaire on the actual practices and value of Open Science and Open Notebook Science in particular.

JC is like the “billboard of ONS” and I mean that in a positive way. He speaks from the heart regarding his passion for ONS, he demonstrates true value and benefits to the approach, has brought together a team of disparate collaborators to produce demonstrations regarding what openness, services and a willing group of people can produce as an outcome. JC focuses his efforts on spreading the word, takes no offense when people don’t support the direction and knows that, ultimately, ONS will have a growing prominence in the future.

Cameron Neylon: Tools meister and theory man. Connector to industry (e.g., Google vis-à-vis Google Wave, and Elsevier.) Tireless and charismatic advocate and astute analyst of key developments in Open Science.

Cameron is a connector too. He is passionate about Open Science and a masterful communicator. Some of his blog posts for me are very much “I wish I’d said that…” in nature. Cameron has a way of bringing clarity to the challenges we face, offering some solutions but in a way that he is not wedded to his approach but to a solution. He is a bridge-builder and, in my opinion, single-handedly got science and Google Wave connected efficiently with the Google staff. There are many people discussing Google Wave and a number of people working with the technology but Cameron has marshaled us into action. He is a trusted evangelist for how such solutions can be applied to science and has the ear of many people, mine included. I just wish we could do more faster to support his vision!

Michael Nielsen: Thinker and theoretician on Open Science of a philosophical bent.

As with most people I have met who are involved with Open Science Michael is passionate, opinionated, thoughtful and fearless. I’ve met Michael only on two occasions, one of these where we shared the podium at the Library of Congress. His presentation was reflective in a way that it calmly called us to attention and action around Open Science. He has the ear of the publishers and some of his blog posts have the community listening, wondering, concerned and optimistic all at the same time. It depends on WHO you are in the community! As I am now in publishing this post was particularly interesting.

Andrew Lang: Supporter of all of the above and master of the nuts and bolts of the tools and technologies of Open Science.

It’s been many years since I saw the A-Team (but I hear they are making a new movie!) but if you ever saw it you’ll remember that you could lock the team in an old garden shed and with baling twine, an old lawnmower and a deck chair they’d be able to make a top notch speed racer. Andy is that character in the world of plugging together online resources to the benefit of those doing Open Science. He’s worked with JC Bradley and used Google Docs, Web Services and Open Data to publish a book on Open Solubility Data on Lulu. The paper co-authored with JC regarding Chemistry in Second Life does a great job in detailing how he is a plumber using the necessary tools to get a result. We’ve worked together with ChemSpider and the Open Spectral Data on the database to create the Spectral Game, both 1D and 2D . Andy is fast, efficient and not shy…I appreciate him asking us for things he needs…it improves our services and he gets to give away to the community for the benefit of all.

Bora Zivkovic: Organizer of the key Conference ScienceOnline (where you will give a presentation this year on ChemSpider) and influential blogger.

Bora is wonderfully influential, easy to talk to, very connected and prolific. He is very valuable to me keeping me connected to what’s going on in the world of science through his tireless communication via all of the delivery systems he uses. He is a master of the social network and a true evangelist for online science. You’ve seen how much work he has done on his blog so far to draw attention to ScienceOnline. While he is not the only one working on it he has a very loud voice, in a good way, in drawing attention to the conference. It’s been fully booked for months…and in this economy. It lends credence to the quality of the meeting as well as to how Science Online is becoming more high profile. Good!

Antony Williams (that’s you): Builder, innovator and creator of key tool of Open Science, ChemSpider. Master of building basic platforms of Open Science and standard science and adroit and effective leader of outreach to mainstream scientific societies and organizations.

Is that about right or am I way off here?

I’d define myself as an Idea Guy with a try-it-and-see mentality. I’ve worked in government labs, in academia, in Fortune 500 corporate America, in small start-up transitioned to established presence. I now work for a British publisher and have owned two of my own companies. There is a place for ideas in all places of course but the smaller the organization the more likely it is that you can run in many directions trying out various things. In our domain that only holds true if it requires sweat and intellect and not capital investments. So much was possible with ChemSpider because the platform was cheap and the investment was intellectual sweat. I still like to try out many ideas in parallel if they have low time investments. Some of the greatest pay-off projects I’ve ever been involved with come from such investigations. There are others who I judge work in a similar way and getting a lot done with minimal resources…Andy Lang, Rich Apodaca and Egon Willighagen among them.

ChemSpider wasn’t built just by me. There was a key group of individuals who worked very hard on the platform. I love being part of a highly effective team who knuckle down and progress projects. My role is definitely one of hands-across-the-seas trying to navigate the complexities of multiple opinions, stances and needs to deliver benefits to the community at large and the organizations involved. It is hard enough to establish win-win in some cases and gets very complex when there are many parties involved. But the challenge is what makes it stimulating and I rarely back down easily and stand for what’s right.

Despite my recent roles that have been more business oriented I still think of myself as a scientist and try to keep my hand into NMR though it’s limited today to software algorithms for NMR prediction and computer assisted structure elucidation. There’s a list of my papers on Mendeley and a couple of new ones in preparation now. I am fortunate that I have remained engaged with ACD/Labs at a technical level and get to work with some of the best intellect in the world in terms of NMR prediction and structure elucidation algorithms. I believe that to improve it is best to surround myself with people I can learn from and that invigorates my grey matter!

Could you please discuss the concept of ChemSpider Everywhere and how ChemSpider is linked to from blogs and provide examples of open source applets and ChemMobi—in the latter case could you provide us a scenario of how a scientist might use ChemMobi while, say, in the audience at a talk at a scientific conference or in a chat with a graduate student during a chance encounter in the hallway of a chemistry building?

The concept behind ChemSpider Everywhere is that our web services will allow those systems and scientists needing access to ChemSpider resources will be able to get to them. I’m living a kind of “Intel Inside” mentality where structural information based on curated dictionaries connected to associated data can be fed into systems as necessary. Where is this happening already? Our web services are already integrated to mass spectrometry software from Bruker, Agilent, Waters and Thermo. There is interest in tapping ChemSpider to retrieve chemical structures associated with mass spectral data so they will fire off a file from their processing software and hit a subset of the ChemSpider database, generally a set of databases containing drug-related and metabolism information (KEGG, Drugbank, Human Metabolome Database etc). When they hit the database with a query for monoisotopic masses and narrow the search on these databases they will generally retrieve an appropriate set of hits and these are not just structures to embed into their software but links to ChemSpider with all of its rich resources.

We developed an “Embed” functionality similar to that for YouTube where people could find a structure of interest on ChemSpider and rather than having to save the image and then embed into a blog post/wiki article etc. they could grab a piece of JavaScript and insert it. This would then retrieve the structure from ChemSpider, would link the user back to the chemical record and give single click entry into all of the resources associated with this compound.

We did the same with spectral data for people to use and this feeds the SpectralGame from JC Bradley and Andy Lang.

Our ChemSpider services allow ChemSpider to be on the iPhone via ChemMobi and will show up in other mobile applications shortly.

We’ve delivered browser widgets and allow embedding of our searches into websites.

Recently JC Bradley and Andy Lang worked to produce an online book, hosted on LuLu, that uses our services to provide properties and structures integrated with data hosted on GoogleDocs. The result is wonderful.

In order to help Wikipedians create structure boxes easily (ChemBoxes and DrugBoxes) we implemented a WikiBox generator. ChemSpider data are already in a lot of places.

Our data are in PubChem and will soon be in the Symyx DiscoveryGate database.

More and more databases are linking back to us. Even the Chemical Abstracts Service indexed us and have over 300,000 structures from ChemSpider in their database. We are working on providing links from ChEBI at present.

Our intentions is that ChemSpider will be everywhere…in an appropriate manner supporting the community. We welcome more ideas of what that would look like…but are not short of our own!

What is an InChl and what is an InChl Resolver and why are they important?

An InChI is an international chemical identifier and is a way to represent a chemical structure in alphanumeric text. A very basic definition is on Wikipedia and that gives some basic figures showing how a chemical structure breaks down into a series of alphanumeric layers to describe the complexity of a chemical structure. ChemSpider was built on InChIs as a way to deduplicate structures and allow fast structure searching. InChI is becoming part of the connection network for chemistry online and I provided an overview of both internet-based chemistry and how InChIs are important to creating a structure searchable internet in a recent presentation to Drexel university students (available as a video on SciVee).

One of the issues with InChI “Strings” are that they vary in length and, for large molecules, they can truncate in a search engine thereby reducing the impact in terms of searching. After a presentation at Google the InChI team were encouraged to consider hashing the strings to provide a fixed format InChIKey, mostly so that search engines can handle them. The problem with an InChI hash is that there is no way to reverse…only use a lookup.

Almost two years ago I wrote a blog post entitled “We need an InChI Resolver and we need it now.” That blog post offers an extensive description of WHY we need a resolver and some of the challenges associated. There are a lot of comments on the post and worth reading.

Now, there are complexities with InChIs in terms of the number of option settings associated with InChIs and so there can be many InChIStrings and InChIKeys for a single compound. Therefore, Standard InChI settings were introduced to help. Hashes also have their own issues and a single InChI Key can come from many structures. The first observation of this was made at the ACS in Washington last fall.

From my initial blog post we took the feedback and with the kind support of the RSC we went off and built an example of a resolver. It’s a proof of concept and what is necessary is really a federated approach where multiple resolvers can be integrated. We are presently in discussions about this with other parties and a federated resolver will be available, in proof of concept form, later this year.

Could you please discuss your work with Wikipedia and your reactions to its recent move to be less open and more reliant on expert editing? Is that a good thing or bad thing vis-à-vis ChemSpider and for society at large and for Wikipedia itself?

The best overview about starting the work was written almost two years ago now. What was initiated after a conversation with Martin Walker from the Wikipedia Chemistry team was a project whereby we would examine every chemical compound on Wikipedia, atom by atom, bond by bond, and validate the correctness of the displayed structure relative to its association with the chemical name, the CAS number, the PubChem link and other related data. What started out as a project that might take a couple of months is only now nearing completion. It has involved discussions literally down to a single stereo bond as it did with Tacrolimus.

In that situation I ended up proving that one stereocenter was inverted relative to all the authorities. I unfortunately created a bit of a disturbance when I suggested that the ONLY way to validate the CAS numbers on Wikipedia records was to search SciFinder and validate the structure-CAS number relationship. CAS initially objected publicly but eventually a beneficial collaboration was established resulting in validation of an overlapping set of compounds from CAS and Wikipedia. CAS also then released their CommonChemistry site to the public.

The validation effort of Wikipedia has been a focus project for a small dedicated team of about half a dozen people. It’s been long and laborious because we all have so many other projects underway but I believe it has been successful and the quality of structural representations and associated data has improved dramatically.

I confess that I am only slightly aware of the situation of Wikipedia becoming less open and more controlled in its editing. On ChemSpider we have always had users, curators and master curators, proactively acknowledging skills upfront and giving more responsibilities to certain parties. This has worked pretty well for us. Wikipedia should be grateful to all contributors. Certainly the contributions of experts to validating and enhancing articles will be very important. Personally I think articles on Wikipedia are, in general, very useful when they are more than stubs.

Could you please discuss ChemSpider’s relations with PubMed and other services of the National Library of Medicine?

Our relationships to date are simply those of using PubMed services/API to access information and data to link up to ChemSpider. PubMed is truly an amazing resource and our connection last year delivered at the ACS really raised some eyebrows regarding what’s possible as a result of integrating public resources with the appropriate programming interfaces.

How is it going with the RSC? Do you have any advice for those who have created Open Science tools that start to garner interest and acquisition offers from mainstream publishers and scientific societies?

It took a while to migrate the ChemSpider platform into the RSC environment. We were moving the system from three fairly nominal servers running in a basement to an infrastructure based on virtual servers. We also moved from a “continuous beta” where we were updating code to the live environment on a regular basis to an environment where we had development servers, test servers and live servers. This migration took a while but the benefits are that we have full IT support now and are not carrying all the responsibilities for maintaining the hardware, the backups, the core software platform (SQL server, IIS etc). We also have less risk in terms of power outages and the pipes serving up ChemSpider are much thicker. The benefits to the user are obvious…better uptime and faster overall response from the site.

In terms of people who have created Open Science tools and want to garner interest in terms of acquisition my suggestion may be quite contrary. Don’t pursue the sell off as the endgame but build the appropriate solution for your users, develop a following and let your success speak for itself. I believe that’s what we did well…we created a solution of value to the community, stayed focused and on task and let the community speak on our behalf regarding the value of what we were building. Hopefully the publishers and societies are watching and will engage.

What are you most proud of vis-à-vis ChemSpider? What have been the biggest challenges since you started working on it and what has been particularly gratifying about the relationship with the RSC? What are your plans, if any, for expansion in the US?

I am most proud of the fact that we stayed true to our vision of “Building a structure centric community for chemists.” We have been upfront and honest in all of our discussions with our users and the community via the blog and on many other public forums. I believe that we have earned the respect of our users and am humbled by the number and quality of scientists who have been willing to support and work with us.

One of the challenges included that of balancing the hurdles of paying the bills, both personal and business, while growing a free access website with no immediate revenue stream. Certainly the biggest challenge with ChemSpider was the initial attacks on our efforts made by certain members of the blogosphere. There was a period of many months where our every move was questioned, examined and discussed publicly. These attacks were the ones that actually brought me into the blogosphere and gave me a voice to discuss our intentions, our challenges and certainly our imperfections. Over the next few months I had to clean up a lot of misinformation and clarify our reality over others stories. It was an interesting time but in many ways we owe a debt of gratitude to those who challenged us as it helped us continue to declare our mission publicly and stay focused on delivering.

We have no immediate plans to expand in the US at present and need to balance expansion of the ChemSpider resources with overall project needs. Our team is not involved only with the delivery of ChemSpider but with a number of other projects to support cheminformatics within the organization.

What scientific conferences do you recommend that young scientists interested in Open Science attend?

You and I will both be at ScienceOnline2010 in January 2010. That would be a good one for exposing people to what’s going on with Science Online, much of it Open. Many of the major conferences in any of the sciences now have sessions regarding data sharing and Open Science..it’s just a matter of looking for the session. Certainly my experiences with the Google SciFoo camp, one I was fortunately invited to on two occasions, was one of exposure to a lot of Open Science…unfortunately it’s invitation only.

What blogs should they read? Yours for one, right?

Mine is at http://www.chemspider.com/blog/. In recent months I have stopped visiting my RSS reader as much and look for what dribbles onto Twitter from the blogs and navigate over. Ones that I visit regularly are 1) Science in the Open from Cameron Neylon, 2) Useful Chemistry from JC Bradley, 3) Michael Nielsen’s blog, 4) John Overington’s CHEMBl’og, 5) the Mendeley Blog and 6) ALL of David Bradley’s blogs. I read a lot of others also, but this is a short list.

Whom should they follow on Twitter?

I’d suggest following David Bradley, Timo Hannay, Duncan Hull, Egon Willighagen, Cameron Neylon, Rafael Sidi and Chris Anderson (Wired Magazine). What these gents have to say is specific to my mixed domain of cheminformatics and publishing so may not be of interest to chemists directly.

Should they immediately jump into the lively Life Scientists room on FriendFeed?

As with all forms of social networking tools online I would say that people need to have an interest in such an environment to begin with. Many people don’t read blogs yet, many are unaware of what RSS feeds are, Twitter would be an annoyance and FriendFeed just one more on the list. It is a very small fraction of the community that is using these tools but it is growing of course. I find FriendFeed of value to ask questions and engage a group of domain experts in discussion but the reality is that, for me, this is a tight knit community and I could engage the majority of them directly by email. I believe that Cameron Neylon and JC Bradley have had success in using FriendFeed to initiate projects and activities around funding applications. The truth is that there are so many groups to interact with and so many activities already underway for ChemSpider that I have backed away a little from all of these tools of late just because of time limitations. I’m focusing instead on building closer working relationships with a select group of people with whom I can get things done and produce an outcome or measurable output. I found that I was losing a lot of time in a week on conversations that didn’t lead anywhere. An interesting discussion was recently started here.

What are your plans for ChemSpider for the next year? For the next five?

ChemSpider was conceived in December 2006 and released to the public at the Spring ACS in Chicago in 2007. In terms of a functional system it’s less than 3 years old and having a 5 year vision would be as simple as having chemists think of ChemSpider as their first port of call to source information about “chemicals” as well as host information about chemistry. Information about chemicals will include properties, data, associated literature, suppliers, associated scientists, Open Notebook Science pages, links to information in other databases and data sources and so on. Hosting information about chemistry will include the chemical substance itself but also associated data, synthetic procedures, analytical data etc. generated within a scientist’s laboratory. If we can make this happen at a minimum then we have delivered a foundation dataset and appropriate set of web services to feed the developing semantic web for chemistry. In so doing we will have, hopefully, delivered one of the primary search engines supporting internet-based chemistry.

Finally, who are your heroes in chemistry, science generally, academia, technology and in any other aspect of life?

My heroes in chemistry are people who I believe have directly mentored me in a way that has led to an improvement in my understanding of some aspect of my work. In this list I include Gary Martin who has taught me much about NMR spectroscopy (and we have published a lot together!), Keith Preston, my PostDoc supervisor, now retired, who taught me much about how to question experimental observations and Mike Detty and Steve Godleski who I worked with at Kodak running many midnight experiments just out of interest and passion. I’m sure that none of these people will be recognizable names for your audience.

In science generally I look to the work and influence of Hawking, Feynman and Pauling but I am impressed by the teams of people who work on international projects such as the LHC.

Technology is a difficult area to be specific about as it’s not easy to pinpoint the key individuals but the breakthroughs come down to teams within specific organizations and they are hard to separate. I follow Google’s and Microsoft’s developing technologies. Apple are always innovating and tweaking and I appreciate their focus on delivery. In my specific area of science, NMR, I believe that Bruker has demonstrated the most consistent innovations in recent years and this has also extended into other analytical technologies including mass spectrometry. They are of course challenged by Waters and Thermo.

In other areas of my life I am impressed with investigative journalists such as Christopher Bryson and Debbie Bookchin and Jim Schumacher who distil complex information into a form digestible by the public. Wired magazine is my favorite read.

I am not a sports person in terms of watching and following teams….I’m more of a doer. I’m a runner, regular visitor to the gymnasium to lift weights and have started running 5km races with my wife and twin boys at the weekend. I’ve just set myself a personal target of running 1000 miles in the next year and clocking it with technologies such as Nike+ and engaging the community in my “suffering” through a blog.

I’m also hoping to use some of my energy to raise some money for asthma as one of our sons is asthmatic. If anyone out there is connected into the asthma not for profits and can connect me up that would be great. Let’s see how it goes!

Thank you for your time, Dr. Williams.

4 Responses

  1. Hope,

    There’s a lot here, but I’ll just respond to this quote:

    “I’m focusing instead on building closer working relationships with a select group of people with whom I can get things done and produce an outcome or measurable output. I found that I was losing a lot of time in a week on conversations that didn’t lead anywhere.”

    This is the smart approach. There’s a lot of noise out there and social media can be extremely anti productive. The trick is to maintain time limits, communicate only with people that add value to the experience, and to focus on using the channels only when it helps you. It’s a wide-spread misconception that you have to be connected and producing content all the time to be successful. The reality is that simply having a presence and contributing in meaningful ways when you can gets you most of the way there.

    That said, this is a great talk. Open standards in general is (and should be) the trajectory we chose as a society. Open science, open standards, open government. It gets the data into the hands of the people who need it most and who are in positions to use it to make the greatest impact on society.

  2. Hi, Steffan. Thank you so much for your very interesting comments. I am buoyed by your statement, “It’s a wide-spread misconception that you have to be connected and producing content all the time to be successful…” given that I don’t post as often on this blog as I had hoped!

    Apropos of your comment, “Open science, open standards, open government. It gets the data into the hands of the people who need it most and who are in positions to use it to make the greatest impact on society…” I have been quite surprised and dismayed by the lack of comment on this initiative and at this formum:

    http://blog.ostp.gov/2010/01/07/phase-iii-wrap-up/

    It started off well in Phase 1 with over 200 comments but as of today in the final period for open comment I am the sole respondent–pretty sad on such an important matter in a country of over 300 million! I have tried spur others in various realms to add input with scant success. I am really baffled–important decisions are to be made and it was wonderful of the Office of Science and Technology Policy to sponsor the forum.

    Thank you for starting the discussion Antony Williams pointed me to.

    Hope

  3. Steffan’s summary statement “There’s a lot of noise out there and social media can be extremely anti productive.” hits the nail on the head. It is not difficult to burn many hours a day reviewing the activities on the social network. A number of people have commented that I have become less visible in the network since joining the RSC with ChemSpider. I agree. The reason is quite simple…I have a full time job now and responsibilities to projects and timelines that leaves me a lot less time to participate as much as I did previously. I still participate as fully as I can but a lot less than I’d like.

  4. [...] be a very interesting area to follow. The subject of Open Science is the topic of discussion in this post which features an interview with Anthony Williams from ChemSpider, which is an open science project [...]

Leave a Reply