How Microsoft's External Research Division works with a new breed of e-scientists

Play How Microsoft's External Research Division works with a new breed of e-scientists
Sign in to queue


Tony Hey, VP for the External Research Division within Microsoft Research, leads the company's efforts to build external partnerships in key areas of scientific research, education, and computing. He's been a physicist, a computer scientist, and dean of engineering, and for five years ran the UK's e-Science program. These experiences have given him a broad view of the ways in which all the sciences are becoming both computational and data-intensive. Microsoft tools and services, he says, will support and sustain the new breed of scientists riding this new wave.

Audio: WMA, MP3

Generic Episode Image
Tony Hey

JU: For this series of interviews I've spoken to a number of Microsoft folks who are working with external academic partners on projects that fall under your purview. The list includes Pablo Fernicola's Word add-in for scientific publishing, Catharine van Ingen's collaboration with Dennis Baldocchi at Berkeley on the analysis of C02 data, and Kyril Faenov's HPC++ project to bring cluster computing to the classroom. These are all pieces of your puzzle, right?

TH: Absolutely.

JU: By way of background, you've been a physicist, then a computer scientist, and then for a time led the UK's e-science program.

TH: Which would be called cyberinfrastructure in the US, yes. I'm on the NSF's advisory committee for cyberinfrastructure, it's a very similar goal.

JU: And then you surprised a lot of people by joining Microsoft. Take us through your initial role leading the TCI [technical computing initiative] and on to your current expanded role leading MSR's external research efforts.

TH: Right. So having been a physicist, and then a computer scientist working on parallel computing for years, and then chair of my computer science department and then dean of engineering, I think I understand the community we're trying to work with pretty well.

Also, as you mentioned, I worked for 5 years running the UK e-science program. That was about huge amounts of distributed data, and collaborative multi-disciplinary research in a variety of fields. The environment, bioinformatics, almost every field of science now has some element of distributed and networked collaboration.

The science agenda was for the tools and technologies to make that collaboration trivial, just as with Web 2.0 your grandmother can do a mashup.

I don't think the UK e-science program achieved that, but I do believe that Microsoft can help make tools and technologies available that will help scientists and researchers do their work.

JU: In your parallel computing phase, you helped write the MPI [message passing interface] specification, correct?

TH: Yes. I've been in this for 30 years, on and off. I have very good friends in the high-performance and parallel computing communities here in the US, and I was involved in European projects. There was a danger that the Europeans would go one way, and the US another, so it was time to see if we could get the community to put together a community standard.

It isn't an ISO standard, there wasn't a big standards body, it was a group of experts who got together with the academics and with the industry players. Rather a small set, and we used to meet every 6 weeks in Dallas airport, so you really had to be dedicated to go there.

JU: [laughs]

TH: But what came out of it was a standard which has stood the test of time. I co-authored and initiated the first draft. It's been much changed since then, and I don't take credit for the final thing, but I did try, with Jack Dongarra, to initiate the standards process, and I think I remember buying the beer at the first session.

JU: What's interesting to me is that despite that, you've been a vocal skeptic regarding raw grid capability. And you've been very careful to stress that in your view, the real challenges have to do with data -- the ability to combine large quantities of data from multiple sources, and enable people to make sense of it.

TH: Yes. I used to work in high-end supercomputing and parallel computing, but what distinguishes this decade is that we'll collect more scientific data than we have collected in the whole of human history. Instead of struggling with the problem of too little data, scientists will be struggling with the problem of huge amounts that they can't process or analyze. And it may be stored in different places, on different continents, so how do you put it together? How do you federate?

That's the real challenge. Very people want to use petaflop computers. Most of the biologists, chemists, and engineers only need lesser capabilities that can be provided by just a simple cluster. And then you put the cluster where the data is, because that's what's difficult to move around.

JU: Yes, Kryil Faenov made this same point in my interview with him. There are only a handful of intergalactic cloud infrastructures of the sort that a Google or Amazon or Microsoft can support, they're one-of-a-kind beasts, and you can't always bring your data to them. So he's interested in enabling organizations to stand up their own more modest clusters at the sites where the data lives.

So, let's discuss the opportunity that you see. In another interview you said:

Rather than wasting the enthusiasm and talents of science graduate students by assigning them the task of building systems capable of handling, analyzing and mining literally petabytes of data, scientists should look to computer scientists and the IT companies to raise the level of abstraction and to provide them with the components of a reliable and functional cyberinfrastructure.


That's the most concise mission statement I've found for what you're doing.

TH: Exactly right. Part of my reason for joining Microsoft was having had a great friendship, and many discussions and arguments, with Jim Gray, from 2001 onwards.

We argued and disagreed on many things, but we also agreed on things, and what we agreed on in particular is that a different paradigm is emerging. So for example there's experimental physics, there's theoretical physics, and now the third paradigm, it's clear, is computational physics based on simulation.

What we're looking at here is data-centric science, where you'll do collections-based research -- like you do in mashups, but now with scientific datasets. And increasingly, you'll use semantics to get from data to information to real knowledge.

So I came to Microsoft partly because of Jim Gray, but partly because I think companies can help. I struggled mightily with just open source tools. I used to produce open source tools myself, as an academic. MPI has a wonderful open source implementation, and that was one of the key things that we did.

But I also know that open source, particularly when produced by academics like myself, well, it works on my machine, but if you want it to work on your machine, that's your problem.

So one of the things I set up in the UK was, in fact, a software engineering center called the Open Middleware Infrastructure Institute, where I put a lot of money in to get these open source codes tested and documented and made more reliable and sharable.

That's why I think that a judicious mix of open source with commercial -- it could be from IBM, from Oracle, from Microsoft -- is the way to provide a more reliable infrastructure.

That's part of the motivation for the tools we're producing around the technologies that scientists use to do their publication, their data mining, and so on. I think Microsoft can really take a lead here, and that's why I joined.

JU: Elsewhere you've said:

Essentially I match up Microsoft researchers with major scientific problems that computer science technology can help to solve.


What are those major problems?

TH: So, I came with a purely scientific mission with TCI. But now I've moved into Microsoft Research, and we have a bigger agenda. In terms of external research, we focus on four themes.

One is health and wellness. That's bioinformatics, medical solutions, and so on. Really exciting, we've got some good projects in that area.

JU: I've talked to Kris Tolle and have done an interview with George Hripscak who's one of the recipients of funding in the genome-wide association studies program.

TH: Kris is great, she and Simon Mercer are looking at the biomedical area, and they've got a wonderful set of projects ranging from high-tech stuff involving RNA and HIV/AIDS down to the last mile of preventative health care, looking at ways in Latin America to take a smartphone and connect it to a low-cost diagnostic tool, like a blood-pressure monitor, and therefore do health care in these remote places.

The next major area is what we call E3 -- earth, energy, and the environment. That includes the astronomy work that Jim Gray started, which we now have followed up with the WorldWide Telescope, which is a wonderful tool.

JU: It's a brilliant thing. I've actually done two in-depth conversations about it for this series. One with Curtis Wong, and the other with Jonathan Fay.

TH: It does exactly the things we were talking about, it takes lots of distributed data sets, and allow you to search and visualize and do wonderful things.

So that's one example of an E3 project. Catharine van Ingen's project is another, and there are others. There's a project called the Swiss Experiment that's putting sensors all through the Swiss Alps to measure environmental changes.

JU: Before we discuss the other two areas, let me just ask: What is a project? I gather sometimes Microsoft Research puts out an RFP, and somebody like George Hripcsak at Columbia is awarded money to pursue his research. In other cases, though, there isn't necessarily funding, it's more of a collaboration, as with Catharine van Ingen and Dennis Baldocchi.

TH: Yes. In all cases, I want us to focus on genuine partnership with the academics. It has to be win/win on all sides. There are all sorts of ways. RFPs are one. Targeted funding, like we used to do in TCI, maybe sponsoring post-docs. But other things too, like delivering tools, data sets, services.

What can we do for the computer science community? That's another of our themes.

I used to teach in a computer science department, and I assure you my department was not atypical. We taught Linux, Apache, MySQL, PHP, Java, and they used a variety of scripting languages -- Perl, Python, and now Ruby on Rails.

To teach computer science principles it's quite clear you don't necessarily need any Microsoft technology. So the question is, how do we engage with academics in the computer science disciplines?

JU: And what are your thoughts?

TH: We have an opportunity. We need to look at what services, what data, what resources we can give them, so we can partner in a way that they feel is beneficial, so they can do research in the way they want to, and we can find out what services they need, and how we can make our tools more valuable.

Microsoft does now have the beginnings of some exciting service offerings. There's Live Mesh, and we have .NET online services coming along...I liked our internal name, CloudDB, better than SQL Server Data Services, SSDS, but...

JU: ...that's how it always goes.

TH: That's the way of it, yes. So that's in beta at the moment, and I hope by the time of the PDC in October we'll have a lot more concrete things to show. What I need to do is see what we can offer the academic community in terms of resources. Can we help them to explore multi-core? Can we get them data sets at scale that we've anonymized, so they can do research they'd otherwise not be able to do?

JU: And Dryad?

TH: Yes. We now have within Microsoft Research some internal resources -- cores -- and I want to make some of that available externally, and put some services around it, such as Dryad or Dryad/LINQ.

At the Faculty Summit I want to ask the community -- and after all, I came from that community -- how can we partner with you so that we can give you things that you value, and get your feedback?

JU: What is the Faculty Summit, who's been invited, and what do you aim to accomplish there?

TH: It's an annual event in the U.S., three or four hundred academics come, mainly computer scientists from the U.S. but there's a sprinkling from around the world -- India, China, Latin America. Really it's an opportunity for us to connect.

I've talked about health and wellness, earth/energy/environment, and computer science. Another area of focus is education and scholarly communication. We'll be unveiling plugins for our tools that make them more useful for scientists to do what they want to do.

JU: The NLM add-in for Word is an obvious example. Are there others?

TH: Yes, we'll announce a Creative Commons plug-in. Many people use Word, PowerPoint, and Excel, and are happy to share their documents. We'd like to give them a plug-in that will help them attach Creative Commons licenses to those documents.

We'll also have a research repository. At the university, I was supposed to monitor the output of my faculty -- 200 academics and 500 post-docs and grad students. What we did was insist on keeping a digital copy of not only publications, but also presentations at conferences, research reports, videos, data...

JU: ...especially data. That's a huge new area.

TH: It is in my view, yes. My undergraduates and engineering faculty never went into the library for traditional library purposes. They went there for a cup of coffee, a chat with their friends, a warm place to work, but not as a library.

So what is the role of the library? My view is very much the MIT DSPACE view that's been promoted. The role of a research library in a university is to be the guardian of the intellectual output of the university. And that needn't just be research, it can be teaching materials.

So we've used SQL Server, and the Entity Framework -- a bit like the RDF model of Tim Berners-Lee and friends -- to capture some semantic knowledge. So it tells you this is a presentation, Tony Hey gave it, the local organizers were so and so, it was done on this date, and so on.

JU: There's also the general notion of wrapping services around raw data sets. I've talked with Timo Hannay at Nature about how often, nowadays, somebody winds up publishing a paper as a "fig leaf of analysis" to cover what's really the publication of some data set.

TH: Timo and I absolutely agree on this. Research repositories which contain text and also data are going to be increasingly important.

JU: Although you're not wild about the term "data services", it's actually useful. I was talking with Jonathan Fay about his discovery of all the astronomical data that's online. On the one hand, it was astonishing to find that it was available at all. On the other hand, in the grand tradition of academia, these were gzipped tarballs that you could only use if you had an extreme amount of specialized knowledge and capability.

What you get, with WorldWide Telescope, is a service layer wrapped around all that raw data that makes it available to a vastly wider audience.

TH: Absolutely. Same with Catharine van Ingen's project. This stuff was locked away in files, and nobody knew what was there. By making it available and exposing it in new're right, these data services are very important.

And they're the basis of some of our other projects. So for example, Valerie Daggett at the University of Washington does protein folding, but she also does protein unfolding. She regards protein folding ab initio, right from the beginning with just the structure, as too difficult. So she takes the folded structure and unfolds it, and then looks at the possible foldings you can get. She calls this dynameomics. It involves storing detailed simulations, and we've made a database to help her do that.

JU: How would you characterize the nature of the collaboration between Microsoft Research and Valerie Daggett?

So, for example, with Catharine van Ingen and Dennis Baldocchi, it was a really interesting mesh of interests and capabilities. Dennis is a climate scientist who's plugged into a worldwide network of sensors, but he's not an informatician, he's not someone with deep training in how to probe and reshape a body of data. But that's what Catharine brings to the table.

So in this protein-folding collaboration, what's the partnership really about?

TH: It's on two levels. Valerie really is a computational scientist. She does these computationally-intensive calculations, and she uses national supercomputers.

One of the things we've done is give them experimental Windows HPC clusters, so instead of doing it remotely they can actually get a lot of calculations done on local machines.

The other part is that they don't have particular expertise in databases. So Stuart Ozer, who used to be in Jim Gray's group and now is back with SQL Server, collaborated with them to set up a data cube.

JU: It seems like the transfer of database expertise is a common thread in a lot of these collaborations. Although many of these folks may be computationally-oriented scientists, and may know how to work with algorithms and with code, the data management is another kind of discipline, and not one that necessarily comes naturally.

TH: That's right. By the way, we're also active in computational education for scientists. When I did that in the 80s and 90s it was about algorithms and parallelism and things like that. But you're quite right, it's now, in addition to those things, about knowing how to deal with data.

We have projects with two Nobel Prize winners, Phil Sharp at MIT, and Carl Wieman at Vancouver, looking at what you teach biologists and physicists about new skills, in order to produce a new generation of computational scientists who understand the data as well as the computation.

And I'd be remiss if I didn't mention Stephen Emmott. I emphasize the data, but he'd say that the complexity of the modeling that you have to do with this data is as important. And therefore, some of the abstractions from computer science can really help the modeling side of science.

One of our engagements is a joint bioformatics modeling institute in Trento, and that's an initiative of Stephen Emmott and his team.

JU: I guess that most people know Microsoft has a massive research arm, and there's been a lot said and written about internal technology transfer -- something gets invented in MSR, then it's thrown over the wall into a product group. People have heard that story, but this other story about external collaboration isn't so well known.

TH: That's true, though it does link to our research within MSR. In terms of the computer science and education communities, we have wonderful tools here that actually don't end up in products. One of the things I hope to do is make more of these available.

We now, at Microsoft, have two OSI-approved open source license, Ms-PL and Ms-RL. I'd like to make some of our tools, which aren't going into products, available so that we can build communities and show what great tools there are. Tools that really do things the computer science community and science community want.

So, I talked about our four themes -- health and wellness, earth/energy/environment, computer science, education and scholarly communication. In addition we have what we call ARTS: Advanced Research Tools and Services. There we're trying to develop tools and services that academics and computer scientists will find valuable.

And there are many others. We just did a count, and in total, with RFPs and small projects and big projects, we had, over the whole of Microsoft Research something, like 400 projects with external partners in universities.

My challenge is to focus that a bit more, and make sure we capture and build on the ones that are successful.

JU: Well, very good, Tony. Thanks a lot!

TH: Thanks very much, Jon.



Download this episode

The Discussion

Add Your 2 Cents