<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" media="screen" href="/styles/xslt/rss.xslt"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:c9="http://channel9.msdn.com">
<channel>
	<title>Channel 9 - Entries tagged with e-science</title>
    <atom:link rel="self" type="application/rss+xml" href="http://channel9.msdn.com/Tags/e-science/RSS"/>
    <itunes:summary></itunes:summary>
    <itunes:author>Microsoft</itunes:author>
    <itunes:subtitle></itunes:subtitle>
    <image>
      <url>http://mschnlnine.vo.llnwd.net/d1/Dev/App_Themes/C9/images/feedimage.png</url>
      <title>Channel 9 - Entries tagged with e-science</title>
      <link>http://channel9.msdn.com/Tags/e-science</link>
    </image>
    <itunes:image href=""/>
    <itunes:category text="Technology"/>
    <description>Channel 9 keeps you up to date with the latest news and behind the scenes info from Microsoft that developers love to keep up with. From LINQ to SilverLight – Watch videos and hear about all the cool technologies coming and the people behind them.</description>
    <link>http://channel9.msdn.com/Tags/e-science</link>
    <language>en</language>
    <pubDate>Sun, 12 Feb 2012 14:31:27 GMT</pubDate>
    <lastBuildDate>Sun, 12 Feb 2012 14:31:27 GMT</lastBuildDate>
    <generator>Rev9</generator>
    <c9:totalResults>2</c9:totalResults>
    <c9:pageCount>1</c9:pageCount>
    <c9:pageSize>25</c9:pageSize>
  <item>
      <title>Roger Barga on Trident, a workbench for scientific workflow</title>
      <description><![CDATA[
<p>Roger Barga, a principal architect with Microsoft's Technical Computing Initiative, is leading the development of Trident, a &quot;workflow workbench&quot; for science. In its first incarnation, the tool will enable oceanographers to automate the management and analysis
 of vast quantities of data produced by the <a href="http://en.wikipedia.org/wiki/NEPTUNE">
Neptune sensor array</a>. But as Roger explains in this interview, it's not just about oceanography. Every science is becoming data-intensive. Trident's graphical workflow authoring, reusable data transforms, and support for provenance -- the ability to reliably
 track and reproduce all the analytic steps leading to a scientific result -- is being used by astronomers too, and is expected to find its way into many other disciplines as well.
</p>
<br>
<br>
<table width="300">
<tbody>
<tr>
<td><img src="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/barga/barga.jpg">
<div><strong>Roger Barga</strong> </div>
</td>
</tr>
</tbody>
</table>
<p><strong>JU:</strong> We're here to talk about the <a href="http://www.microsoft.com/mscorp/tc/trident.mspx">
Trident</a>, the scientific workflow workbench for oceanography. Give us the 50,000-foot overview, then we'll zoom in.</p>
<p><strong>RB:</strong> Scientists are increasingly dealing with large volumes of data coming from disparate sources. The process used to be manageable. You'd get post-docs to convert the raw data from the instruments into readable formats, there was a manual
 workflow to process the data into useful data products. </p>
<p><strong>JU:</strong> Those were the good old days. Or maybe not so good.</p>
<p><strong>RB:</strong> Right. Because the time to get from raw data to those useful products was often measured in weeks or months. But now our ability to capture data has outpaced our ability to process and visualize it. And its rising exponentially with
 the rapid deployment of cheap sensors.</p>
<p>The oceanographic project we're working on, Neptune, is just one example of this. Astronomy, and all other sciences, are experiencing the same trend.</p>
<p><strong>JU:</strong> Neptune is a University of Washington oceanographic project ...</p>
<p><strong>RB:</strong> ... it's actually an NSF project. The proper name is <a href="http://www.joiscience.org/ocean_observing/initiative">
Ocean Observatories Initiative</a>, and it's being funded for several hundred million dollars. The University of Washington is one of the partners. Monterey Bay Aquarium Research Institute and a number of coastal observatories as well are involved.</p>
<p><strong>JU:</strong> So fiberoptic cables are being laid, and lots of oceanographic data will be pouring in.</p>
<p><strong>RB:</strong> Exactly. It's transformed oceanography from a data-poor discipline to a data-rich one. They're going to be able to monitor the oceans 24x7 over long periods of time. So the kinds of processes they can study were never within reach before.
 They could collect data when there was an episodic event, or when they could get funding. Now they'll be collecting permanently.</p>
<p><strong>JU:</strong> What's the scope of the sensor network?</p>
<p><strong>RB:</strong> They're laying the trench in Monterey to test and deploy the sensors. NSF is reviewing the larger program, and getting ready to fund the Neptune array which will be off the coast of Washington and Oregon. The Canadian version of the
 Neptune array is up and running and collecting data, but the software infrastructure is still being built as we speak.</p>
<p><strong>JU:</strong> What quantities of data is the Canadian array producing?</p>
<p><strong>RB:</strong> Gigabytes per day. It can easily handle a couple of high-def video streams coming from the ocean floor.</p>
<p><strong>JU:</strong> Really?</p>
<p><strong>RB:</strong> Yes. And also in-situ devices that can sequence organisms. It really is like not only taking Internet and power out to the ocean, but also a USB bus that instruments can be plugged into.</p>
<p><strong>JU:</strong> What are some of the experiments that become possible with this setup?</p>
<p><strong>RB:</strong> For example, being able to understand sediment flows across the ocean floor, how temperature and salinity change, how fresh water flows in from rivers, what kind of life exists at those margins. And understanding that interesting narrow
 band where life thrives in the ocean. Too high up and the tides affect it, too low and there's not enough light. But really, there are a myriad of things like that.
</p>
<p><strong>JU:</strong> So an experiment, in this data-intensive new world, involves formulating a hypothesis, looking for patterns in previously-collected data, and then seeing whether data collected in the future supports the hypothesis.
</p>
<p>That means you not only need to run an analysis on data, but that you have to be able to repeat that analysis on an evolving body of data. Hence the need for the workflow automation that you're providing in the workbench.</p>
<p><strong>RB:</strong> Yes. Another aspect is the need to calibrate and tune the models. If they can do that based on long-term monitoring, it'll remove a lot of the uncertainty in our understanding of the oceans. Versus now, where the data are so sparse that
 it's hard to validate the model.</p>
<p><strong>JU:</strong> I guess also that as your understanding of the data and the models evolves, you might want to rethink what data you're capturing and how you're interpreting it. So, what is it that you've built with Trident, and how does it help you
 do those things?</p>
<p><strong>RB:</strong> Jim Gray was the first person who had the vision of an oceanographer's workbench. His insight was that scientists really want to interact with visualizations of the ocean, but there was a huge gap between the raw data and those visualizations.
</p>
<p>Managing information and managing data is one of Microsoft's core strengths. In
<a href="http://research.microsoft.com/erp/">External Research</a>, we look for partnership opportunities where can bring our technology, learn from applying it to data-intensive stress tests that involve even more data than our commercial products currently
 handle, and figure out how to use or extend our technology to provide a solution.</p>
<p>Jim pointed out that workflow was one of the key missing ingredients. We looked at the in-house tools, and Windows Workflow was the engine of choice...</p>
<p><strong>JU:</strong> ...although it didn't exist at the time Jim floated this idea, right?</p>
<p><strong>RB:</strong> Well, yes, it was around in alpha and beta form internally. Jim knew I was doing some of my research using Windows Workflow. Of course he left the solution up to us, but he accurately identified workflow as being a way that the scientist
 could not only manage the data transformations that were needed, but also create a library of solutions that could be shared and reused.</p>
<p>If you look at how Microsoft works as a company, we build platforms and then we expect ISVs to come in and bridge the gap between the platforms and the user communities. That's the role our group has played. We're looking at the requirements of the scientists,
 we're looking at the platform Microsoft provides, and we're building on that platform to provide a custom solution to the scientists that will not only accelerate their work, but change how they do science -- enable them to ask and answer questions they couldn't
 before.</p>
<p>We partnered initially with the University of Washington and Monterey Bay Aquarium Research Institute, or MBARI. They're already gathering data from sensors, so they could describe the spectrum of data we'd have to ingest into our workflows. The University
 of Washington has a visualization tool called <a href="http://www.cs.washington.edu/homes/keithg/oceans.html">
COVE</a>, which scientists are adopting as the preferred way to look at the ocean floor. You can think of it as Virtual Earth for the ocean. If there's bathymetry data, you can pull it in and se the ocean floor.
</p>
<p><strong>JU:</strong> What kinds of data transformations are needed to get from the sensor outputs to COVE's inputs?</p>
<p><strong>RB:</strong> There are probably about two dozen kinds of data sources we need to be able to ingest, based on the instruments and the types of data they put out. Typically it's streaming data in
<a href="http://www.unidata.ucar.edu/software/netcdf/">NetCDF format</a>, or some other common format. So the first step is to recognize what kind of data format an instrument or model is kicking out, and transform it into an internal structure that our tool
 can use.</p>
<p><strong>JU:</strong> But the workflow engine is abstracted from the instrumentation data formats and from the visualization tools, right? It's a mechanism for reproducibly running transformations, and managing that pipeline.</p>
<p><strong>RB:</strong> Right. But let's start with how we interacted with the scientists. Jim Gray would ask scientists: &quot;What are the top 20 questions you want to ask, and queries you want to run?&quot; From that, he'd get an understanding of how they viewed the
 data, and what kind of processing was required.</p>
<p>We took the same approach, and asked the scientists which top 20 workflows they perform and which top 20 visualizations they like to see. Then we went through them from top to bottom, talking about the transforms and data integration that were required.
 We wound up with a set of two dozen transformations that were common across all of these workflows. That became the library of activities -- reusable chunks of code -- that the scientists could call upon to author not only these 20 workflows, but the next
 20.</p>
<p><strong>JU:</strong> Can you give a couple of examples?</p>
<p><strong>RB:</strong> Sure. Regridding. You have two data sets, one's from a model and the other's from a set of deployed sensors out in the ocean. They're on different grid coordinate systems and you need to be able to bring those two together. That may
 require some interpolation, you might need to drop or add data points, transform coordinates, join data sets.</p>
<p><strong>JU:</strong> There might be a temporal variant of the spatial gridding as well, to align different time scales?
</p>
<p><strong>RB:</strong> Right. Some instruments are getting things every second, some are getting them every 15 minutes. You can ask the user: &quot;Do you want interpolation to take place? Do you want the system to match up the points?&quot; Based on these inputs, the
 correct workflow gets configured and they see the resulting visualization for the region of ocean they're interested in.</p>
<p><strong>JU:</strong> It sounds like some of these primitives will wind up being fairly general, not just specific to oceanography.</p>
<p><strong>RB:</strong> Indeed they are. We're producing a version of Trident for oceanography, but many of these activities could be useful for other sciences as well. People in earth sciences, for example, are also using NetCDF and many of the same operations.</p>
<p>We expect that by building a tool which is extensible, and agnostic in terms of the science it supports, you can imagine it being used, for example, to understand the interaction between oceans and warm air currents.
</p>
<p><strong>JU:</strong> What does the Trident user see and do?</p>
<p><strong>RB:</strong> We realized that the authoring experience for scientific workflow is very different from, say, business workflow. In business, you'd have your accountant write your expense report workflow. They'd lock it down, they'd deploy it, everybody
 would use it from then on, and nobody would touch it until it came back for bug fixes or enhancements.
</p>
<p>What we found with scientists is that they want to borrow somebody's workflow that does what they want, or close to it, load that workflow, and then start authoring from that point on.
</p>
<p>So we implemented that in Trident. You can search for workflows by purpose, or by the inputs they process. You click on one, and load it into a visual browser because while the oceanographers understand the workflows, they don't want to see C# or Java, they
 want to see something visual -- boxes that represent the transformations they want to apply.
</p>
<p><strong>JU:</strong> We've mentioned the Windows Workflow Foundation. For folks who aren't familiar with that system, how would you characterize it? How is it like and unlike a script execution engine?</p>
<p><strong>RB:</strong> What's unique about workflow, versus scripting, is that with workflow you tease apart the notion of a schedule, which is the sequence of actions you'd like to have performed. If you were to look inside of each of those steps, you'd see
 code similar to what you'd find in a script. But on top of the sequence of steps you have an orchestration engine. When you pass this workflow -- this sequence of steps -- over to the orchestration engine, it runs the code inside each of the boxes, but as
 each one completes, control passes to the orchestration engine. </p>
<p>So we have an abstraction layer, we've opened up the opportunity for reuse, the steps or activities become building blocks. In addition, the orchestration engine can monitor the execution of the workflow, or change the way it executes -- for example, by
 running blocks in parallel on a multicore machine. </p>
<p><strong>JU:</strong> What struck me about the Workflow Foundation was the way in which workflows can be very big or very small. As small as the sequence of interactions with a form on a web page, in which case the orchestration engine can be embedded entirely
 in the code that's behind that web page. </p>
<p>Or it can be a very big thing. But in any case, since it's part of the .NET Framework, it can exist in a variety of places. It can run locally on a laptop, it can run on a server in the cloud. There's an interesting amount of flexibility in terms of how
 workflows can be deployed. An application could embed Trident, or Trident could be used as a service.</p>
<p><strong>RB:</strong> That's right. That's the magic of it. Yes, it could be hosted in an environment that the scientist is already familiar with. Or for a big institution, you could post it up as a service. Anybody could access it from a browser. And that's
 part of our mantra here. If we provide this to the scientists, we have to make sure it works with the tools they're comfortable using. You should be able to point your Linux box running Firefox at this tool.</p>
<p>But to your other point, we're experimenting here with workflows that are resource-seeking. You could launch one, perhaps even on your cellphone, and that scheduling engine's going to look for systems that have resources for that workflow, tap into them,
 and give the user on the cellphone the impression it's running locally. </p>
<p><strong>JU:</strong> You've mentioned that the workflow style encourages a level of modularity that you might not otherwise get. It also provides a level of monitoring, control, and auditing. The reason that's important goes back to the idea of reproducibility.
</p>
<p>A friend of mine is an HPC expert, and one of his pet peeves is that when people look at HPC they tend to focus on how much raw horsepower can be thrown at a problem. His question is: &quot;Who's worrying about reproducibility and correctness?&quot; It's a really
 important question. </p>
<p>In your environment, as I understand it, one of the things that you get is the ability to capture and replay and analyze what happened in a workflow, and the ability to faithfully reproduce a sequence of steps. You talked about enabling things that scientists
 couldn't do before. It's not only that they couldn't analyze large quantities of data, but also that they couldn't automate their own methods, and be able to reflect on them in an automated way.</p>
<p><strong>RB:</strong> Right. Even if we couldn't run a workflow faster, and even if we weren't processing a lot more data, one of our key features is support for provenance.
</p>
<p><strong>JU:</strong> Explain what you mean by provenance.</p>
<p><strong>RB:</strong> Think about it in terms of art. For a given piece of art, we're able to establish through authorities that it's original, where it came from, and who's had their hands on it through its lifetime. Provenance for a workflow result is the
 same thing. Minimally we want to be able to establish trust in a result. If you think about how that happens, it often starts by considering who wrote the workflow. So with Trident you can click on a result and interrogate the history of the workflow: who
 wrote it, who reviewed it, who revised it, when it first entered the system.</p>
<p>We do versioning as well, so you can look at an old result and know that it was created by an old version of the workflow. And then have the ability to run the new version on the old dataset to see if it makes a difference.
</p>
<p>We capture execution provenance so you know exactly how your result was created. We capture provenance on the workflows themselves so you know who created them, and who's touched them.
</p>
<p>You might be thinking about creating a community, where you click on a workflow and can say: &quot;OK, I trust that post-doc.&quot;</p>
<p><strong>JU:</strong> I've been reflecting on what Microsoft brings to the world of science, in yours and in other collaborations that I've been talking to MSR folks about. One is clearly the special competence and expertise in data management and processing.
 Even for computationally-oriented scientists, that data expertise isn't necessarily a core competence.
</p>
<p>Another is the software tradition of version control. Again, that hasn't been a traditional strength of scientists. So this looks like a fruitful partnership on both fronts.
</p>
<p><strong>RB:</strong> Agreed. It would be nice to get <a href="http://perspectives.on10.net/blogs/jonudell/Making-sense-of-C02-data/">
Catharine van Ingen</a>, or perhaps Alex Szalay to chime in how how this is being used for astronomy. Because we're giving drops of this code to our e-science researchers for use in other areas.
</p>
<p><strong>JU:</strong> I'd love talk with Alex. I had a couple of in-depth conversations about the WorldWide Telescope, one with
<a href="http://blog.jonudell.net/2008/06/23/the-story-of-the-worldwide-telescope/">
Curtis Wong</a> and the other with <a href="http://blog.jonudell	.net/2008/07/14/how-the-worldwide-telescope-works/">
Jonathan Fay</a>, and we touched on the work Alex has done. He's using your stuff as well?</p>
<p><strong>RB:</strong> Not him personally, but his project -- <a href="http://pan-starrs.ifa.hawaii.edu/public/">
Pan-STARRS</a> -- is. Catharine van Ingen and Yogesh Simmhan are co-architects of that system along with Alex. And they're bringing workflow to the table. It's becoming the way scientists upload their data into Pan-STARRS and get it back out, and Trident is
 the workflow engine for that.</p>
<p>You've probably also heard about other activities here in External Research. Perhaps the scholarly communiations aspect?</p>
<p><strong>JU:</strong> Yep. I've talked to <a href="http://perspectives.on10.net/blogs/jonudell/Word-for-scientific-publishing/">
Pablo Fernicola</a> about the Word add-in for authoring scientific papers in the National Library of Medicine XML format. And recently I got the
<a href="http://blog.jonudell.net/2008/07/31/a-conversation-with-tony-hey-about-microsoft-external-research-and-the-new-breed-of-e-scientists/">
overview of External Research</a> from Tony Hey.</p>
<p><strong>RB:</strong> When you think about Trident in the context of scholarly communication -- and to your point about the importance of provenance, we see eye to eye on that -- not only can we use these tools for e-science data management, but we're focusing
 on reproducible research. When Trident has finished running a workflow, we'll create an XML structure that describes how to call back into Trident to recreate the result. We're really keen on the idea that not only is it easier to do the science, and publish
 the science, but actually reproduce it. And that XML description should be able to be embedded in the published work.</p>
<p>That's really exciting. It's been talked about in the computational sciences, but never addressed end to end with a tool that's instrumented, that produces an XML standard the community can own which describes how the science was done, and that gets carried
 along with the publication, either physically or by reference, and we store this execution script in a database somewhere.
</p>
<strong>JU:</strong> It's a really big idea.
<p></p>
<strong>RB:</strong> It is, I think it could be transformational.
<p></p>
<strong>JU:</strong> I do too.
<p></p>
<strong>RB:</strong> Right now, reproducibility means that that you happen to know the person who did the experiment, or you happen to capture enough stuff in your lab notebook or on your whiteboard, then you have a chance of being able to do it again. But
 imagine being able to click any result, and automatically and transparently reproduce that result.
<p></p>
<strong>JU:</strong> In reality it won't necessarily be the case that you can punch a button and have everything replayed exactly. But having the documentation, at that level of detail, and in that form, would be an incredible asset.
<p></p>
<strong>RB:</strong> Agreed. The hope is that here in External Research, because we're building these tools not just in the context of one science project, but many, you can have community tools that bridge communities. We're talking to people in the earth
 sciences doing atmospheric studies, and their workflows and analyses are so similar to what the oceanographers are doing. But right now, since those two communities aren't talking or sharing tools, it's very difficult for one community to interact with the
 other.
<p></p>
<strong>JU:</strong> That's a really nice point. Well, thanks Roger!
<p></p>
<strong>RB:</strong> See you later.  <img src="http://m.webtrends.com/dcs1wotjh10000w0irc493s0e_6x1g/njs.gif?dcssip=channel9.msdn.com&dcsuri=http://channel9.msdn.com/Tags/e-science/RSS&WT.dl=0&WT.entryid=Entry:RSSView:1c01c77d61df4759b8199dea0119d491">]]></description>
      <comments>http://channel9.msdn.com/Blogs/JonUdell/Roger-Barga-on-Trident-a-workbench-for-scientific-workflow</comments>
      <itunes:summary>
Roger Barga, a principal architect with Microsoft&#39;s Technical Computing Initiative, is leading the development of Trident, a &amp;quot;workflow workbench&amp;quot; for science. In its first incarnation, the tool will enable oceanographers to automate the management and analysis
 of vast quantities of data produced by the 
Neptune sensor array. But as Roger explains in this interview, it&#39;s not just about oceanography. Every science is becoming data-intensive. Trident&#39;s graphical workflow authoring, reusable data transforms, and support for provenance -- the ability to reliably
 track and reproduce all the analytic steps leading to a scientific result -- is being used by astronomers too, and is expected to find its way into many other disciplines as well.







Roger Barga 




JU: We&#39;re here to talk about the 
Trident, the scientific workflow workbench for oceanography. Give us the 50,000-foot overview, then we&#39;ll zoom in.
RB: Scientists are increasingly dealing with large volumes of data coming from disparate sources. The process used to be manageable. You&#39;d get post-docs to convert the raw data from the instruments into readable formats, there was a manual
 workflow to process the data into useful data products. 
JU: Those were the good old days. Or maybe not so good.
RB: Right. Because the time to get from raw data to those useful products was often measured in weeks or months. But now our ability to capture data has outpaced our ability to process and visualize it. And its rising exponentially with
 the rapid deployment of cheap sensors.
The oceanographic project we&#39;re working on, Neptune, is just one example of this. Astronomy, and all other sciences, are experiencing the same trend.
JU: Neptune is a University of Washington oceanographic project ...
RB: ... it&#39;s actually an NSF project. The proper name is 
Ocean Observatories Initiative, and it&#39;s being funded for several hundred million dollars. The University of Washington is one of the pa</itunes:summary>
      <itunes:duration>1890</itunes:duration>
      <link>http://channel9.msdn.com/Blogs/JonUdell/Roger-Barga-on-Trident-a-workbench-for-scientific-workflow</link>
      <pubDate>Thu, 28 Aug 2008 17:41:00 GMT</pubDate>
      <guid isPermaLink="false">http://channel9.msdn.com/Blogs/JonUdell/Roger-Barga-on-Trident-a-workbench-for-scientific-workflow</guid>
      <media:group>
        <media:content url="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/barga/barga.mp3" expression="full" duration="1890" fileSize="15136512" type="audio/mp3" medium="audio"/>
        <media:content url="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/barga/barga.wma" expression="full" duration="1890" fileSize="15312203" type="audio/x-ms-wma" medium="audio"/>
      </media:group>      
      <enclosure url="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/barga/barga.wma" length="15312203" type="audio/x-ms-wma"/>
      <dc:creator>JonUdell</dc:creator>
      <itunes:author>JonUdell</itunes:author>
      <slash:comments>0</slash:comments>
      <wfw:commentRss>http://channel9.msdn.com/Blogs/JonUdell/Roger-Barga-on-Trident-a-workbench-for-scientific-workflow/RSS</wfw:commentRss>
      <category>e-science</category>
      <category>oceanography</category>
      <category>podcasts</category>
      <category>Workflow</category>
    </item>
  <item>
      <title>How Microsoft&#39;s External Research Division works with a new breed of e-scientists</title>
      <description><![CDATA[
<p>Tony Hey, VP for the External Research Division within Microsoft Research, leads the company's efforts to build external partnerships in key areas of scientific research, education, and computing. He's been a physicist, a computer scientist, and dean of
 engineering, and for five years ran the UK's e-Science program. These experiences have given him a broad view of the ways in which all the sciences are becoming both computational and data-intensive. Microsoft tools and services, he says, will support and
 sustain the new breed of scientists riding this new wave. </p>
<p>Audio: <a href="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/hey/hey.wma">
WMA</a>, <a href="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/hey/hey.mp3">
MP3</a> </p>
<br>
<br>
<table width="300">
<tbody>
<tr>
<td><img alt="" src="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/hey/hey.jpg">
<div><b>Tony Hey</b> </div>
</td>
</tr>
</tbody>
</table>
<p><b>JU</b>: For this series of interviews I've spoken to a number of Microsoft folks who are working with external academic partners on projects that fall under your purview. The list includes Pablo Fernicola's
<a href="http://perspectives.on10.net/blogs/jonudell/Word-for-scientific-publishing/">
Word add-in for scientific publishing</a>, Catharine van Ingen's collaboration with Dennis Baldocchi at Berkeley on the
<a href="http://perspectives.on10.net/blogs/jonudell/Making-sense-of-C02-data/">analysis of C02 data</a>, and Kyril Faenov's HPC&#43;&#43; project to bring
<a href="http://perspectives.on10.net/blogs/jonudell/Cluster-computing-for-the-classroom/">
cluster computing to the classroom</a>. These are all pieces of your puzzle, right?</p>
<p><b>TH</b>: Absolutely.</p>
<p><b>JU</b>: By way of background, you've been a physicist, then a computer scientist, and then for a time led the UK's e-science program.</p>
<p><b>TH</b>: Which would be called cyberinfrastructure in the US, yes. I'm on the NSF's advisory committee for cyberinfrastructure, it's a very similar goal.</p>
<p><b>JU</b>: And then you surprised a lot of people by joining Microsoft. Take us through your initial role leading the TCI [technical computing initiative] and on to your current expanded role leading MSR's external research efforts.</p>
<p><b>TH</b>: Right. So having been a physicist, and then a computer scientist working on parallel computing for years, and then chair of my computer science department and then dean of engineering, I think I understand the community we're trying to work with
 pretty well.</p>
<p>Also, as you mentioned, I worked for 5 years running the UK e-science program. That was about huge amounts of distributed data, and collaborative multi-disciplinary research in a variety of fields. The environment, bioinformatics, almost every field of science
 now has some element of distributed and networked collaboration.</p>
<p>The science agenda was for the tools and technologies to make that collaboration trivial, just as with Web 2.0 your grandmother can do a mashup.</p>
<p>I don't think the UK e-science program achieved that, but I do believe that Microsoft can help make tools and technologies available that will help scientists and researchers do their work.</p>
<p><b>JU</b>: In your parallel computing phase, you helped write the MPI [message passing interface] specification, correct?</p>
<p><b>TH</b>: Yes. I've been in this for 30 years, on and off. I have very good friends in the high-performance and parallel computing communities here in the US, and I was involved in European projects. There was a danger that the Europeans would go one way,
 and the US another, so it was time to see if we could get the community to put together a community standard.
</p>
<p>It isn't an ISO standard, there wasn't a big standards body, it was a group of experts who got together with the academics and with the industry players. Rather a small set, and we used to meet every 6 weeks in Dallas airport, so you really had to be dedicated
 to go there.</p>
<p><b>JU</b>: [laughs]</p>
<p><b>TH</b>: But what came out of it was a standard which has stood the test of time. I co-authored and initiated the first draft. It's been much changed since then, and I don't take credit for the final thing, but I did try, with Jack Dongarra, to initiate
 the standards process, and I think I remember buying the beer at the first session.</p>
<p><b>JU</b>: What's interesting to me is that despite that, you've been a vocal skeptic regarding raw grid capability. And you've been very careful to stress that in your view, the real challenges have to do with data -- the ability to combine large quantities
 of data from multiple sources, and enable people to make sense of it.</p>
<p><b>TH</b>: Yes. I used to work in high-end supercomputing and parallel computing, but what distinguishes this decade is that we'll collect more scientific data than we have collected in the whole of human history. Instead of struggling with the problem of
 too little data, scientists will be struggling with the problem of huge amounts that they can't process or analyze. And it may be stored in different places, on different continents, so how do you put it together? How do you federate?</p>
<p>That's the real challenge. Very people want to use petaflop computers. Most of the biologists, chemists, and engineers only need lesser capabilities that can be provided by just a simple cluster. And then you put the cluster where the data is, because that's
 what's difficult to move around. </p>
<p><b>JU</b>: Yes, Kryil Faenov made this same point in my interview with him. There are only a handful of intergalactic cloud infrastructures of the sort that a Google or Amazon or Microsoft can support, they're one-of-a-kind beasts, and you can't always bring
 your data to them. So he's interested in enabling organizations to stand up their own more modest clusters at the sites where the data lives.
</p>
<p>So, let's discuss the opportunity that you see. In another interview you said:
</p>
<blockquote>Rather than wasting the enthusiasm and talents of science graduate students by assigning them the task of building systems capable of handling, analyzing and mining literally petabytes of data, scientists should look to computer scientists and the
 IT companies to raise the level of abstraction and to provide them with the components of a reliable and functional cyberinfrastructure.
</blockquote>
<p>&nbsp;</p>
<p>That's the most concise mission statement I've found for what you're doing.</p>
<p><b>TH</b>: Exactly right. Part of my reason for joining Microsoft was having had a great friendship, and many discussions and arguments, with Jim Gray, from 2001 onwards.
</p>
<p>We argued and disagreed on many things, but we also agreed on things, and what we agreed on in particular is that a different paradigm is emerging. So for example there's experimental physics, there's theoretical physics, and now the third paradigm, it's
 clear, is computational physics based on simulation. </p>
<p>What we're looking at here is data-centric science, where you'll do collections-based research -- like you do in mashups, but now with scientific datasets. And increasingly, you'll use semantics to get from data to information to real knowledge.
</p>
<p>So I came to Microsoft partly because of Jim Gray, but partly because I think companies can help. I struggled mightily with just open source tools. I used to produce open source tools myself, as an academic. MPI has a wonderful open source implementation,
 and that was one of the key things that we did.</p>
<p>But I also know that open source, particularly when produced by academics like myself, well, it works on my machine, but if you want it to work on your machine, that's your problem.
</p>
<p>So one of the things I set up in the UK was, in fact, a software engineering center called the
<a href="http://www.omii.ac.uk/">Open Middleware Infrastructure Institute</a>, where I put a lot of money in to get these open source codes tested and documented and made more reliable and sharable.</p>
<p>That's why I think that a judicious mix of open source with commercial -- it could be from IBM, from Oracle, from Microsoft -- is the way to provide a more reliable infrastructure.</p>
<p>That's part of the motivation for the tools we're producing around the technologies that scientists use to do their publication, their data mining, and so on. I think Microsoft can really take a lead here, and that's why I joined.</p>
<p><b>JU</b>: Elsewhere you've said: </p>
<blockquote>Essentially I match up Microsoft researchers with major scientific problems that computer science technology can help to solve.
</blockquote>
<p>&nbsp;</p>
<p>What are those major problems?</p>
<p><b>TH</b>: So, I came with a purely scientific mission with TCI. But now I've moved into Microsoft Research, and we have a bigger agenda. In terms of external research, we focus on four themes.
</p>
<p>One is health and wellness. That's bioinformatics, medical solutions, and so on. Really exciting, we've got some good projects in that area.</p>
<p><b>JU</b>: I've talked to Kris Tolle and have done an <a href="http://perspectives.on10.net/blogs/jonudell/Making-sense-of-electronic-health-records/">
interview with George Hripscak</a> who's one of the recipients of funding in the <a href="http://www.microsoft.com/presspass/press/2008/apr08/04-17GWASPR.mspx">
genome-wide association studies program</a>. </p>
<p><b>TH</b>: Kris is great, she and Simon Mercer are looking at the biomedical area, and they've got a wonderful set of projects ranging from high-tech stuff involving RNA and HIV/AIDS down to the last mile of preventative health care, looking at ways in Latin
 America to take a smartphone and connect it to a low-cost diagnostic tool, like a blood-pressure monitor, and therefore do health care in these remote places.</p>
<p>The next major area is what we call E3 -- earth, energy, and the environment. That includes the astronomy work that Jim Gray started, which we now have followed up with the WorldWide Telescope, which is a wonderful tool.</p>
<p><b>JU</b>: It's a brilliant thing. I've actually done two in-depth conversations about it for this series. One with
<a href="http://perspectives.on10.net/blogs/jonudell/The-story-of-the-WorldWide-Telescope/">
Curtis Wong</a>, and the other with <a href="http://perspectives.on10.net/blogs/jonudell/How-the-WorldWide-Telescope-works/">
Jonathan Fay</a>.</p>
<p><b>TH</b>: It does exactly the things we were talking about, it takes lots of distributed data sets, and allow you to search and visualize and do wonderful things.
</p>
<p>So that's one example of an E3 project. Catharine van Ingen's project is another, and there are others. There's a project called the
<a href="http://www.swiss-experiment.ch/index.php/Category:About">Swiss Experiment</a> that's putting sensors all through the Swiss Alps to measure environmental changes.</p>
<p><b>JU</b>: Before we discuss the other two areas, let me just ask: What is a project? I gather sometimes Microsoft Research puts out an RFP, and somebody like George Hripcsak at Columbia is awarded money to pursue his research. In other cases, though, there
 isn't necessarily funding, it's more of a collaboration, as with Catharine van Ingen and Dennis Baldocchi.</p>
<p><b>TH</b>: Yes. In all cases, I want us to focus on genuine partnership with the academics. It has to be win/win on all sides. There are all sorts of ways. RFPs are one. Targeted funding, like we used to do in TCI, maybe sponsoring post-docs. But other things
 too, like delivering tools, data sets, services. </p>
<p>What can we do for the computer science community? That's another of our themes.</p>
<p>I used to teach in a computer science department, and I assure you my department was not atypical. We taught Linux, Apache, MySQL, PHP, Java, and they used a variety of scripting languages -- Perl, Python, and now Ruby on Rails.
</p>
<p>To teach computer science principles it's quite clear you don't necessarily need any Microsoft technology. So the question is, how do we engage with academics in the computer science disciplines?</p>
<p><b>JU</b>: And what are your thoughts?</p>
<p><b>TH</b>: We have an opportunity. We need to look at what services, what data, what resources we can give them, so we can partner in a way that they feel is beneficial, so they can do research in the way they want to, and we can find out what services they
 need, and how we can make our tools more valuable.</p>
<p>Microsoft does now have the beginnings of some exciting service offerings. There's Live Mesh, and we have .NET online services coming along...I liked our internal name, CloudDB, better than SQL Server Data Services, SSDS, but...</p>
<p><b>JU</b>: ...that's how it always goes.</p>
<p><b>TH</b>: That's the way of it, yes. So that's in beta at the moment, and I hope by the time of the PDC in October we'll have a lot more concrete things to show. What I need to do is see what we can offer the academic community in terms of resources. Can
 we help them to explore multi-core? Can we get them data sets at scale that we've anonymized, so they can do research they'd otherwise not be able to do?</p>
<p><b>JU</b>: And <a href="http://research.microsoft.com/research/sv/Dryad/">Dryad</a>?</p>
<p><b>TH</b>: Yes. We now have within Microsoft Research some internal resources -- cores -- and I want to make some of that available externally, and put some services around it, such as Dryad or Dryad/LINQ.</p>
<p>At the <a href="http://research.microsoft.com/workshops/fs2008/">Faculty Summit</a> I want to ask the community -- and after all, I came from that community -- how can we partner with you so that we can give you things that you value, and get your feedback?</p>
<p><b>JU</b>: What is the Faculty Summit, who's been invited, and what do you aim to accomplish there?</p>
<p><b>TH</b>: It's an annual event in the U.S., three or four hundred academics come, mainly computer scientists from the U.S. but there's a sprinkling from around the world -- India, China, Latin America. Really it's an opportunity for us to connect.
</p>
<p>I've talked about health and wellness, earth/energy/environment, and computer science. Another area of focus is education and scholarly communication. We'll be unveiling plugins for our tools that make them more useful for scientists to do what they want
 to do.</p>
<p><b>JU</b>: The <a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=09C55527-0759-4D6D-AE02-51E90131997E&amp;displaylang=en">
NLM add-in for Word</a> is an obvious example. Are there others?</p>
<p><b>TH</b>: Yes, we'll announce a Creative Commons plug-in. Many people use Word, PowerPoint, and Excel, and are happy to share their documents. We'd like to give them a plug-in that will help them attach Creative Commons licenses to those documents.</p>
<p>We'll also have a research repository. At the university, I was supposed to monitor the output of my faculty -- 200 academics and 500 post-docs and grad students. What we did was insist on keeping a digital copy of not only publications, but also presentations
 at conferences, research reports, videos, data...</p>
<p><b>JU</b>: ...especially data. That's a huge new area.</p>
<p><b>TH</b>: It is in my view, yes. My undergraduates and engineering faculty never went into the library for traditional library purposes. They went there for a cup of coffee, a chat with their friends, a warm place to work, but not as a library.</p>
<p>So what is the role of the library? My view is very much the MIT DSPACE view that's been promoted. The role of a research library in a university is to be the guardian of the intellectual output of the university. And that needn't just be research, it can
 be teaching materials.</p>
<p>So we've used SQL Server, and the Entity Framework -- a bit like the RDF model of Tim Berners-Lee and friends -- to capture some semantic knowledge. So it tells you this is a presentation, Tony Hey gave it, the local organizers were so and so, it was done
 on this date, and so on. </p>
<p><b>JU</b>: There's also the general notion of wrapping services around raw data sets. I've
<a href="http://blog.jonudell.net/2007/07/06/a-conversation-with-timo-hannay-about-the-scientific-web/">
talked with Timo Hannay</a> at Nature about how often, nowadays, somebody winds up publishing a paper as a &quot;fig leaf of analysis&quot; to cover what's really the publication of some data set.
</p>
<p><b>TH</b>: Timo and I absolutely agree on this. Research repositories which contain text and also data are going to be increasingly important.</p>
<p><b>JU</b>: Although you're not wild about the term &quot;data services&quot;, it's actually useful. I was talking with Jonathan Fay about his discovery of all the astronomical data that's online. On the one hand, it was astonishing to find that it was available at
 all. On the other hand, in the grand tradition of academia, these were gzipped tarballs that you could only use if you had an extreme amount of specialized knowledge and capability.</p>
<p>What you get, with WorldWide Telescope, is a service layer wrapped around all that raw data that makes it available to a vastly wider audience.</p>
<p><b>TH</b>: Absolutely. Same with Catharine van Ingen's project. This stuff was locked away in files, and nobody knew what was there. By making it available and exposing it in new ways...you're right, these data services are very important.</p>
<p>And they're the basis of some of our other projects. So for example, Valerie Daggett at the University of Washington does protein folding, but she also does protein unfolding. She regards protein folding ab initio, right from the beginning with just the
 structure, as too difficult. So she takes the folded structure and unfolds it, and then looks at the possible foldings you can get. She calls this
<a href="http://peds.oxfordjournals.org/cgi/content/abstract/21/6/353">dynameomics</a>. It involves storing detailed simulations, and we've made a database to help her do that.
</p>
<p><b>JU</b>: How would you characterize the nature of the collaboration between Microsoft Research and Valerie Daggett?
</p>
<p>So, for example, with Catharine van Ingen and Dennis Baldocchi, it was a really interesting mesh of interests and capabilities. Dennis is a climate scientist who's plugged into a worldwide network of sensors, but he's not an informatician, he's not someone
 with deep training in how to probe and reshape a body of data. But that's what Catharine brings to the table.</p>
<p>So in this protein-folding collaboration, what's the partnership really about?
</p>
<p><b>TH</b>: It's on two levels. Valerie really is a computational scientist. She does these computationally-intensive calculations, and she uses national supercomputers.
</p>
<p>One of the things we've done is give them experimental Windows HPC clusters, so instead of doing it remotely they can actually get a lot of calculations done on local machines.
</p>
<p>The other part is that they don't have particular expertise in databases. So <a href="http://research.microsoft.com/~stuarto/">
Stuart Ozer</a>, who used to be in Jim Gray's group and now is back with SQL Server, collaborated with them to set up a data cube.</p>
<p><b>JU</b>: It seems like the transfer of database expertise is a common thread in a lot of these collaborations. Although many of these folks may be computationally-oriented scientists, and may know how to work with algorithms and with code, the data management
 is another kind of discipline, and not one that necessarily comes naturally.</p>
<p><b>TH</b>: That's right. By the way, we're also active in computational education for scientists. When I did that in the 80s and 90s it was about algorithms and parallelism and things like that. But you're quite right, it's now, in addition to those things,
 about knowing how to deal with data. </p>
<p>We have projects with two Nobel Prize winners, <a href="http://www.mit.edu/~biology/facultyareas/facresearch/sharp.html">
Phil Sharp</a> at MIT, and <a href="http://www.scientificblogging.com/cwieman">Carl Wieman</a> at Vancouver, looking at what you teach biologists and physicists about new skills, in order to produce a new generation of computational scientists who understand
 the data as well as the computation.</p>
<p>And I'd be remiss if I didn't mention <a href="http://research.microsoft.com/aboutmsr/presskit/semmott/">
Stephen Emmott</a>. I emphasize the data, but he'd say that the complexity of the modeling that you have to do with this data is as important. And therefore, some of the abstractions from computer science can really help the modeling side of science.</p>
<p>One of our engagements is a joint bioformatics modeling institute in Trento, and that's an initiative of Stephen Emmott and his team.
</p>
<p><b>JU</b>: I guess that most people know Microsoft has a massive research arm, and there's been a lot said and written about internal technology transfer -- something gets invented in MSR, then it's thrown over the wall into a product group. People have
 heard that story, but this other story about external collaboration isn't so well known.</p>
<p><b>TH</b>: That's true, though it does link to our research within MSR. In terms of the computer science and education communities, we have wonderful tools here that actually don't end up in products. One of the things I hope to do is make more of these
 available. </p>
<p>We now, at Microsoft, have two OSI-approved open source license, <a href="http://www.microsoft.com/resources/sharedsource/licensingbasics/publiclicense.mspx">
Ms-PL</a> and <a href="http://www.microsoft.com/resources/sharedsource/licensingbasics/reciprocallicense.mspx">
Ms-RL</a>. I'd like to make some of our tools, which aren't going into products, available so that we can build communities and show what great tools there are. Tools that really do things the computer science community and science community want.</p>
<p>So, I talked about our four themes -- health and wellness, earth/energy/environment, computer science, education and scholarly communication. In addition we have what we call ARTS: Advanced Research Tools and Services. There we're trying to develop tools
 and services that academics and computer scientists will find valuable.</p>
<p>And there are many others. We just did a count, and in total, with RFPs and small projects and big projects, we had, over the whole of Microsoft Research something, like 400 projects with external partners in universities.
</p>
<p>My challenge is to focus that a bit more, and make sure we capture and build on the ones that are successful.</p>
<p><b>JU</b>: Well, very good, Tony. Thanks a lot!</p>
<p><b>TH</b>: Thanks very much, Jon.</p>
 <img src="http://m.webtrends.com/dcs1wotjh10000w0irc493s0e_6x1g/njs.gif?dcssip=channel9.msdn.com&dcsuri=http://channel9.msdn.com/Tags/e-science/RSS&WT.dl=0&WT.entryid=Entry:RSSView:a9e945048eed4248aa4e9dea0119e1be">]]></description>
      <comments>http://channel9.msdn.com/Blogs/JonUdell/How-Microsofts-External-Research-Division-works-with-a-new-breed-of-e-scientists</comments>
      <itunes:summary>
Tony Hey, VP for the External Research Division within Microsoft Research, leads the company&#39;s efforts to build external partnerships in key areas of scientific research, education, and computing. He&#39;s been a physicist, a computer scientist, and dean of
 engineering, and for five years ran the UK&#39;s e-Science program. These experiences have given him a broad view of the ways in which all the sciences are becoming both computational and data-intensive. Microsoft tools and services, he says, will support and
 sustain the new breed of scientists riding this new wave. 
Audio: 
WMA, 
MP3 






Tony Hey 




JU: For this series of interviews I&#39;ve spoken to a number of Microsoft folks who are working with external academic partners on projects that fall under your purview. The list includes Pablo Fernicola&#39;s

Word add-in for scientific publishing, Catharine van Ingen&#39;s collaboration with Dennis Baldocchi at Berkeley on the
analysis of C02 data, and Kyril Faenov&#39;s HPC&amp;#43;&amp;#43; project to bring

cluster computing to the classroom. These are all pieces of your puzzle, right?
TH: Absolutely.
JU: By way of background, you&#39;ve been a physicist, then a computer scientist, and then for a time led the UK&#39;s e-science program.
TH: Which would be called cyberinfrastructure in the US, yes. I&#39;m on the NSF&#39;s advisory committee for cyberinfrastructure, it&#39;s a very similar goal.
JU: And then you surprised a lot of people by joining Microsoft. Take us through your initial role leading the TCI [technical computing initiative] and on to your current expanded role leading MSR&#39;s external research efforts.
TH: Right. So having been a physicist, and then a computer scientist working on parallel computing for years, and then chair of my computer science department and then dean of engineering, I think I understand the community we&#39;re trying to work with
 pretty well.
Also, as you mentioned, I worked for 5 years running the UK e-science program. That was about huge am</itunes:summary>
      <itunes:duration>1800</itunes:duration>
      <link>http://channel9.msdn.com/Blogs/JonUdell/How-Microsofts-External-Research-Division-works-with-a-new-breed-of-e-scientists</link>
      <pubDate>Thu, 31 Jul 2008 16:49:00 GMT</pubDate>
      <guid isPermaLink="false">http://channel9.msdn.com/Blogs/JonUdell/How-Microsofts-External-Research-Division-works-with-a-new-breed-of-e-scientists</guid>
      <media:group>
        <media:content url="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/hey/hey.mp3" expression="full" duration="1800" fileSize="14223360" type="audio/mp3" medium="audio"/>
        <media:content url="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/hey/hey.wma" expression="full" duration="1800" fileSize="14389717" type="audio/x-ms-wma" medium="audio"/>
      </media:group>      
      <enclosure url="http://mschnlnine.vo.llnwd.net/d1/on10/perspectives/hey/hey.wma" length="14389717" type="audio/x-ms-wma"/>
      <dc:creator>JonUdell</dc:creator>
      <itunes:author>JonUdell</itunes:author>
      <slash:comments>0</slash:comments>
      <wfw:commentRss>http://channel9.msdn.com/Blogs/JonUdell/How-Microsofts-External-Research-Division-works-with-a-new-breed-of-e-scientists/RSS</wfw:commentRss>
      <category>e-science</category>
      <category>Microsoft Research</category>
      <category>podcasts</category>
      <category>tony hey</category>
    </item>    
</channel>
</rss>
