Word for scientific publishing
- Posted: Apr 17, 2008 at 8:30 AM
- 995 Views
Right click “Save as…”
Pablo Fernicola is a group manager at Microsoft. He runs a project focused on delivering tools and services for scientific and technical publishing, with a particular interest on the transition from print to electronic and web based content, and its implications
for collaboration, search, and content discovery in the future.
In this interview, Pablo explains how a new add-in for Word, now available as a technical preview, helps authors and publishers of scientific articles work more effectively with one another, and with online archives like PubMed Central.
Pablo Fernicola's blog: ex Scientia
JU: Hi Pablo, thanks for joining us to talk about a new Word add-in for authors of scientific journal articles. It's an interesting story about applying the XML capabilities of Office, and also about the evolution of journal publishing. How did this project get started?
PF: It's an incubation project. Three people had an idea: Jean Paoli, an XML pioneer, Jim Gray...
JU: Oh really? I didn't know he had been involved.
PF: Yes, he and Jean really pushed to get this started, and they both recruited me for this project. It's been a little over a year since Jim disappeared, and that was a big blow, considering his key role.
And third key person is Tony Hey.
JU: We should explain that Tony runs what's called the technical computing initiative, and is very involved in figuring out how Microsoft can help various people in the scientific community address computing and information management challenges.
PF: Right. Scientific authors in many disciplines use Word to write articles. We looked into how to simplify the workflow, streamline the process, and lower the cost. And not just for the authors, but also for the journal publishers.
JU: It's been true for a long time in publishing, and not just scientific publishing, that there have been real challenges getting that Word content converted into the kinds of long-term formats we need: XML that's richly decorated with metadata.
Publishers have tended to use strategies that involve giving people templates that try to use styles to control what's in the document. But since Word 2003, and especially since Word 2007, there have been a set of XML capabilities which have made possible a much more robust approach.
PF: That's right. Before Word 2003, styles were the best you could do. And people got quite far by relying on them. But they were very fragile. When you copied and pasted, styles would bleed across. It was hard to disentangle that when you converted the file.
JU: That's part of the problem. And part of is that, along with the content itself, there's a process involving the metadata, and that process is divided between the author and the journal publisher. It's a shared responsibility, and you need an information management system that embraces that division of labor.
PF: Also: What kind of user interface do you present to these different groups? There are really three groups. First the authors, who are subject-matter experts but don't know anything about the publishing process, and shouldn't have to know.
Second, the journal editors. They're also subject-matter experts, but they also know about the structure of the journal, and about the metadata they need to apply
And third, you have companies and vendors who do backend tools and services, as well as the folks who work on the electronic archives. With the move from print to electronic journals, the role of the archive becomes very significant. Either the journals have their own repositories, or you have centralized repositories at university libraries or larger institutions, for example the National Library of Medicine with PubMed Central, or Cornell with Arxiv.org.
That group is very technical in terms of understanding file formats, elements, and properties.
JU: But even those folks shouldn't necessarily need to master all of that. They'd rather spend their time on math and physics, not the minutia of XML publishing.
PF: That's right. The way the pipeline is set up today, you start with a Word document, and then at a certain point you convert to XML, and from that point on, all the editing happens in an XML editor.
JU: So in biology and medicine, the format defined by the National Library of Medicine, and the one you're supporting in this Word add-in, is called the NLM DTD.
PF: Yes. It's not only used by PubMed Central, but also a lot of the commercial publishers are using it for their archival format. And we're also seeing it used by publishers in other disciplines, for example law and social science.
JU: Really? It's general enough for that?
PF: It is fairly general, and I'm really impressed by how the community related to scientific, technical, and medical publishing is not reinventing the wheel, but instead leveraging something that's in common use.
A significant point is that the format usually does not encode any presentation elements. It's all about the semantics and the metadata, not about what font or background color. As you try to preserve data for the long term, for centuries from now, the presentation is not relevant, it's the content that matters. You can always generate a presentation from it.
JU: So as we see in the accompanying screencast, you've created an add-in that presents editing enhancements both for authors and for editors. The interface for the author helps that person fill in the template and also apply those metadata elements which are appropriate for the author to apply. Then there's a separate interface for the editor. Explain a bit about how this can change the workflow.
PF: If you start from the author side, a key premise was requiring less effort to produce a valid document. You want to avoid having the author round-trip with the editor, back and forth, because they didn't fill in all the required information.
JU: And that happens a lot?
PF: Yes. And it's not just failure to provide the required information. We want to make it easier to provide the correct information. Consider co-authors. You'll likely work with the same ones over and over. You want to avoid having to repetitively enter that information, and avoid having errors creep in. Remember: As we move to electronic publishing, search becomes key. It's the main way people will find articles. To have good search results, you need to know the information in the articles is good. If the last name of the author is misspelled, it's harder to find all the papers from that author.
JU: In terms of the consistency of author information, you can help with this Word add-in by normalizing the metadata editing process, but there still has to be a reliable disambiguated set of author names which are managed by the publishers, and ideally by a federation of publishers, and ultimately even more broadly than that.
PF: Correct. If we look down the road, we see something like a global directory, but we're not there yet. We have to build up to that. When you look at the add-in, we're taking small steps that will get us to at least a better baseline than we have today.
JU: Or, given that the world is moving to that baseline anyway, will help make it quicker and easier to get there.
PF: That's right. If we think of the authors, the key thing is to provide a very simple interface. As we consider features, if they look complicated we'll drop them. One of the prevailing rules is: Don't duplicate Word UI. If there's a way to do tables or equations or reference lists in the Word UI, we'll use those. We don't want to provide a lot of new UI for the authors to learn.
JU: What I find interesting, here, from a workflow perspective, is how people in different roles are touching different pieces of data and metadata. Historically that's been a one-way process. Once the article is converted into the NLM format, it's typically not available to go back to the author for editing in the original context. So the person at the journal has to be responsible for round-tripping change requests.
Similarly with the editing of the metadata. The author might want to make some changes, the journal publisher might want to make some changes, and those things tend to happen in disparate environments. What this is showing is what has always been the promise of robust XML editing on the desktop. You can bring all these chores into a common environment. The unit of workflow, the document, is something that can flow to different people in different contexts, and be modified in different ways, but it hangs together as it moves through the process.
That's a big deal, and it goes far beyond the specific domain of scientific and technical publishing.
PF: Right. And in addition to keeping all the data together and providing a simple interface, publishers have told us that as they move to electronic-first, they expect the cycle times to shrink. With the current disconnected tools and formats, that's hard to achieve. If you want to make a quick revision and send it to the journal, it may be too late because they've started the process of conversion, and once that starts there's no stopping it.
And to your point about other domains, people have told us they want to use this for things like grant requests as well, moving away from article content to other kinds of content that can benefit from the structure and validation.
JU: The problem is almost universal.
PF: Yes, anytime you want to validate content, or preserve it for a long time, these capabilities are relevant.
JU: So 2003 was the first major deployment of XML capability for Office and for Word. We haven't yet seen as much use of that capability as I'd expected. Why?
PF: The biggest challenge was that XML wasn't the default format. You had to have authors do special things to produce XML. Also, if you think of the NLM formats, they contain things that aren't part of normal Word content or UI. In Word 2003, extending the document content, or extending the UI, wasn't as easy as it has become in Word 2007.
With Word 2007, you end up with a set of things, in a single installation, that bring all the enabling capabilities together at the same time and in the same place.
JU: So what you did have, in Word 2003, was user-defined schema, but you're saying that wasn't enough, and that the newer capability of including arbitrary chunks of XML is more flexible for this purpose?
PF: Yeah. There's two parts to that. There's content within the document, so the ability to have new XML elements that are part of the document, and that's more robust and expressive in Word 2007's Open XML format. Then there's the ability to have other XML data packaged within the file. Custom XML is what that's usually called.
JU: And that's the method you're using for the journal metadata?
PF: Right. And since this is all defined as part of the Open XML format, and since the packaging of the file follows the standard as well, developers can build their own tools to create metadata, access metadata, or even create the whole file, they can.
JU: So this is a first cut you're putting out for publishers to experiment with, and to help you refine the templates they'll deploy to authors?
PF: Yes, this is a technology preview for evaluation and feedback. The idea is that the publishers will create the templates themselves.
JU: Who are you working with?
PF: We're talking to many different journals, publishers, and archives. Each constituency has a different set of interests and requirements. Journal editors care a lot about the templates, but folks at PubMed Central and Arxiv care more about how the metadata gets validated.
We expect a beta shortly, and a 1.0 release by late summer. It'll be a free add-in for Word.
JU: Well thanks Pablo. I fear that this will only seem interesting to the relatively small number of folks who have a direct interest in scientific, technical, and medical publishing. But I hope it will be apparent that it's much broader. You hinted at that when you mentioned that the NLM format, despite having been invented for the particular purposes of certain disciplines, is being taken up by people in legal and other disciplines.
I'm excited about it because I care about publishing and metadata and robust information systems and open formats, and this brings all those things together. I'm glad to know that it's happening, and I'm glad you're working on it.
PF: It's really proving the value proposition of XML, and show how it's coming of age in a mainstream production environment.
JU: Yep. For those of us who've been thinking about this for a long time, there's been a tendency to get frustrated and feel like it'll never happen. But it just takes a while for things like this to make their way into the mainstream, and this is a great example of that.
Well, thanks Pablo!
PF: OK, thanks!