Roger Barga on Trident, a workbench for scientific workflow
- Posted: Aug 28, 2008 at 10:41AM
- 4,413 views
Right click “Save as…”
Roger Barga, a principal architect with Microsoft's Technical Computing Initiative, is leading the development of Trident, a "workflow workbench" for science. In its first incarnation, the tool will enable oceanographers to automate the management and analysis of vast quantities of data produced by the Neptune sensor array. But as Roger explains in this interview, it's not just about oceanography. Every science is becoming data-intensive. Trident's graphical workflow authoring, reusable data transforms, and support for provenance -- the ability to reliably track and reproduce all the analytic steps leading to a scientific result -- is being used by astronomers too, and is expected to find its way into many other disciplines as well.
JU: We're here to talk about the Trident, the scientific workflow workbench for oceanography. Give us the 50,000-foot overview, then we'll zoom in.
RB: Scientists are increasingly dealing with large volumes of data coming from disparate sources. The process used to be manageable. You'd get post-docs to convert the raw data from the instruments into readable formats, there was a manual workflow to process the data into useful data products.
JU: Those were the good old days. Or maybe not so good.
RB: Right. Because the time to get from raw data to those useful products was often measured in weeks or months. But now our ability to capture data has outpaced our ability to process and visualize it. And its rising exponentially with the rapid deployment of cheap sensors.
The oceanographic project we're working on, Neptune, is just one example of this. Astronomy, and all other sciences, are experiencing the same trend.
JU: Neptune is a University of Washington oceanographic project ...
RB: ... it's actually an NSF project. The proper name is Ocean Observatories Initiative, and it's being funded for several hundred million dollars. The University of Washington is one of the partners. Monterey Bay Aquarium Research Institute and a number of coastal observatories as well are involved.
JU: So fiberoptic cables are being laid, and lots of oceanographic data will be pouring in.
RB: Exactly. It's transformed oceanography from a data-poor discipline to a data-rich one. They're going to be able to monitor the oceans 24x7 over long periods of time. So the kinds of processes they can study were never within reach before. They could collect data when there was an episodic event, or when they could get funding. Now they'll be collecting permanently.
JU: What's the scope of the sensor network?
RB: They're laying the trench in Monterey to test and deploy the sensors. NSF is reviewing the larger program, and getting ready to fund the Neptune array which will be off the coast of Washington and Oregon. The Canadian version of the Neptune array is up and running and collecting data, but the software infrastructure is still being built as we speak.
JU: What quantities of data is the Canadian array producing?
RB: Gigabytes per day. It can easily handle a couple of high-def video streams coming from the ocean floor.
RB: Yes. And also in-situ devices that can sequence organisms. It really is like not only taking Internet and power out to the ocean, but also a USB bus that instruments can be plugged into.
JU: What are some of the experiments that become possible with this setup?
RB: For example, being able to understand sediment flows across the ocean floor, how temperature and salinity change, how fresh water flows in from rivers, what kind of life exists at those margins. And understanding that interesting narrow band where life thrives in the ocean. Too high up and the tides affect it, too low and there's not enough light. But really, there are a myriad of things like that.
JU: So an experiment, in this data-intensive new world, involves formulating a hypothesis, looking for patterns in previously-collected data, and then seeing whether data collected in the future supports the hypothesis.
That means you not only need to run an analysis on data, but that you have to be able to repeat that analysis on an evolving body of data. Hence the need for the workflow automation that you're providing in the workbench.
RB: Yes. Another aspect is the need to calibrate and tune the models. If they can do that based on long-term monitoring, it'll remove a lot of the uncertainty in our understanding of the oceans. Versus now, where the data are so sparse that it's hard to validate the model.
JU: I guess also that as your understanding of the data and the models evolves, you might want to rethink what data you're capturing and how you're interpreting it. So, what is it that you've built with Trident, and how does it help you do those things?
RB: Jim Gray was the first person who had the vision of an oceanographer's workbench. His insight was that scientists really want to interact with visualizations of the ocean, but there was a huge gap between the raw data and those visualizations.
Managing information and managing data is one of Microsoft's core strengths. In External Research, we look for partnership opportunities where can bring our technology, learn from applying it to data-intensive stress tests that involve even more data than our commercial products currently handle, and figure out how to use or extend our technology to provide a solution.
Jim pointed out that workflow was one of the key missing ingredients. We looked at the in-house tools, and Windows Workflow was the engine of choice...
JU: ...although it didn't exist at the time Jim floated this idea, right?
RB: Well, yes, it was around in alpha and beta form internally. Jim knew I was doing some of my research using Windows Workflow. Of course he left the solution up to us, but he accurately identified workflow as being a way that the scientist could not only manage the data transformations that were needed, but also create a library of solutions that could be shared and reused.
If you look at how Microsoft works as a company, we build platforms and then we expect ISVs to come in and bridge the gap between the platforms and the user communities. That's the role our group has played. We're looking at the requirements of the scientists, we're looking at the platform Microsoft provides, and we're building on that platform to provide a custom solution to the scientists that will not only accelerate their work, but change how they do science -- enable them to ask and answer questions they couldn't before.
We partnered initially with the University of Washington and Monterey Bay Aquarium Research Institute, or MBARI. They're already gathering data from sensors, so they could describe the spectrum of data we'd have to ingest into our workflows. The University of Washington has a visualization tool called COVE, which scientists are adopting as the preferred way to look at the ocean floor. You can think of it as Virtual Earth for the ocean. If there's bathymetry data, you can pull it in and se the ocean floor.
JU: What kinds of data transformations are needed to get from the sensor outputs to COVE's inputs?
RB: There are probably about two dozen kinds of data sources we need to be able to ingest, based on the instruments and the types of data they put out. Typically it's streaming data in NetCDF format, or some other common format. So the first step is to recognize what kind of data format an instrument or model is kicking out, and transform it into an internal structure that our tool can use.
JU: But the workflow engine is abstracted from the instrumentation data formats and from the visualization tools, right? It's a mechanism for reproducibly running transformations, and managing that pipeline.
RB: Right. But let's start with how we interacted with the scientists. Jim Gray would ask scientists: "What are the top 20 questions you want to ask, and queries you want to run?" From that, he'd get an understanding of how they viewed the data, and what kind of processing was required.
We took the same approach, and asked the scientists which top 20 workflows they perform and which top 20 visualizations they like to see. Then we went through them from top to bottom, talking about the transforms and data integration that were required. We wound up with a set of two dozen transformations that were common across all of these workflows. That became the library of activities -- reusable chunks of code -- that the scientists could call upon to author not only these 20 workflows, but the next 20.
JU: Can you give a couple of examples?
RB: Sure. Regridding. You have two data sets, one's from a model and the other's from a set of deployed sensors out in the ocean. They're on different grid coordinate systems and you need to be able to bring those two together. That may require some interpolation, you might need to drop or add data points, transform coordinates, join data sets.
JU: There might be a temporal variant of the spatial gridding as well, to align different time scales?
RB: Right. Some instruments are getting things every second, some are getting them every 15 minutes. You can ask the user: "Do you want interpolation to take place? Do you want the system to match up the points?" Based on these inputs, the correct workflow gets configured and they see the resulting visualization for the region of ocean they're interested in.
JU: It sounds like some of these primitives will wind up being fairly general, not just specific to oceanography.
RB: Indeed they are. We're producing a version of Trident for oceanography, but many of these activities could be useful for other sciences as well. People in earth sciences, for example, are also using NetCDF and many of the same operations.
We expect that by building a tool which is extensible, and agnostic in terms of the science it supports, you can imagine it being used, for example, to understand the interaction between oceans and warm air currents.
JU: What does the Trident user see and do?
RB: We realized that the authoring experience for scientific workflow is very different from, say, business workflow. In business, you'd have your accountant write your expense report workflow. They'd lock it down, they'd deploy it, everybody would use it from then on, and nobody would touch it until it came back for bug fixes or enhancements.
What we found with scientists is that they want to borrow somebody's workflow that does what they want, or close to it, load that workflow, and then start authoring from that point on.
So we implemented that in Trident. You can search for workflows by purpose, or by the inputs they process. You click on one, and load it into a visual browser because while the oceanographers understand the workflows, they don't want to see C# or Java, they want to see something visual -- boxes that represent the transformations they want to apply.
JU: We've mentioned the Windows Workflow Foundation. For folks who aren't familiar with that system, how would you characterize it? How is it like and unlike a script execution engine?
RB: What's unique about workflow, versus scripting, is that with workflow you tease apart the notion of a schedule, which is the sequence of actions you'd like to have performed. If you were to look inside of each of those steps, you'd see code similar to what you'd find in a script. But on top of the sequence of steps you have an orchestration engine. When you pass this workflow -- this sequence of steps -- over to the orchestration engine, it runs the code inside each of the boxes, but as each one completes, control passes to the orchestration engine.
So we have an abstraction layer, we've opened up the opportunity for reuse, the steps or activities become building blocks. In addition, the orchestration engine can monitor the execution of the workflow, or change the way it executes -- for example, by running blocks in parallel on a multicore machine.
JU: What struck me about the Workflow Foundation was the way in which workflows can be very big or very small. As small as the sequence of interactions with a form on a web page, in which case the orchestration engine can be embedded entirely in the code that's behind that web page.
Or it can be a very big thing. But in any case, since it's part of the .NET Framework, it can exist in a variety of places. It can run locally on a laptop, it can run on a server in the cloud. There's an interesting amount of flexibility in terms of how workflows can be deployed. An application could embed Trident, or Trident could be used as a service.
RB: That's right. That's the magic of it. Yes, it could be hosted in an environment that the scientist is already familiar with. Or for a big institution, you could post it up as a service. Anybody could access it from a browser. And that's part of our mantra here. If we provide this to the scientists, we have to make sure it works with the tools they're comfortable using. You should be able to point your Linux box running Firefox at this tool.
But to your other point, we're experimenting here with workflows that are resource-seeking. You could launch one, perhaps even on your cellphone, and that scheduling engine's going to look for systems that have resources for that workflow, tap into them, and give the user on the cellphone the impression it's running locally.
JU: You've mentioned that the workflow style encourages a level of modularity that you might not otherwise get. It also provides a level of monitoring, control, and auditing. The reason that's important goes back to the idea of reproducibility.
A friend of mine is an HPC expert, and one of his pet peeves is that when people look at HPC they tend to focus on how much raw horsepower can be thrown at a problem. His question is: "Who's worrying about reproducibility and correctness?" It's a really important question.
In your environment, as I understand it, one of the things that you get is the ability to capture and replay and analyze what happened in a workflow, and the ability to faithfully reproduce a sequence of steps. You talked about enabling things that scientists couldn't do before. It's not only that they couldn't analyze large quantities of data, but also that they couldn't automate their own methods, and be able to reflect on them in an automated way.
RB: Right. Even if we couldn't run a workflow faster, and even if we weren't processing a lot more data, one of our key features is support for provenance.
JU: Explain what you mean by provenance.
RB: Think about it in terms of art. For a given piece of art, we're able to establish through authorities that it's original, where it came from, and who's had their hands on it through its lifetime. Provenance for a workflow result is the same thing. Minimally we want to be able to establish trust in a result. If you think about how that happens, it often starts by considering who wrote the workflow. So with Trident you can click on a result and interrogate the history of the workflow: who wrote it, who reviewed it, who revised it, when it first entered the system.
We do versioning as well, so you can look at an old result and know that it was created by an old version of the workflow. And then have the ability to run the new version on the old dataset to see if it makes a difference.
We capture execution provenance so you know exactly how your result was created. We capture provenance on the workflows themselves so you know who created them, and who's touched them.
You might be thinking about creating a community, where you click on a workflow and can say: "OK, I trust that post-doc."
JU: I've been reflecting on what Microsoft brings to the world of science, in yours and in other collaborations that I've been talking to MSR folks about. One is clearly the special competence and expertise in data management and processing. Even for computationally-oriented scientists, that data expertise isn't necessarily a core competence.
Another is the software tradition of version control. Again, that hasn't been a traditional strength of scientists. So this looks like a fruitful partnership on both fronts.
RB: Agreed. It would be nice to get Catharine van Ingen, or perhaps Alex Szalay to chime in how how this is being used for astronomy. Because we're giving drops of this code to our e-science researchers for use in other areas.
JU: I'd love talk with Alex. I had a couple of in-depth conversations about the WorldWide Telescope, one with Curtis Wong and the other with Jonathan Fay, and we touched on the work Alex has done. He's using your stuff as well?
RB: Not him personally, but his project -- Pan-STARRS -- is. Catharine van Ingen and Yogesh Simmhan are co-architects of that system along with Alex. And they're bringing workflow to the table. It's becoming the way scientists upload their data into Pan-STARRS and get it back out, and Trident is the workflow engine for that.
You've probably also heard about other activities here in External Research. Perhaps the scholarly communiations aspect?
JU: Yep. I've talked to Pablo Fernicola about the Word add-in for authoring scientific papers in the National Library of Medicine XML format. And recently I got the overview of External Research from Tony Hey.
RB: When you think about Trident in the context of scholarly communication -- and to your point about the importance of provenance, we see eye to eye on that -- not only can we use these tools for e-science data management, but we're focusing on reproducible research. When Trident has finished running a workflow, we'll create an XML structure that describes how to call back into Trident to recreate the result. We're really keen on the idea that not only is it easier to do the science, and publish the science, but actually reproduce it. And that XML description should be able to be embedded in the published work.
That's really exciting. It's been talked about in the computational sciences, but never addressed end to end with a tool that's instrumented, that produces an XML standard the community can own which describes how the science was done, and that gets carried along with the publication, either physically or by reference, and we store this execution script in a database somewhere.JU: It's a really big idea. RB: It is, I think it could be transformational. JU: I do too. RB: Right now, reproducibility means that that you happen to know the person who did the experiment, or you happen to capture enough stuff in your lab notebook or on your whiteboard, then you have a chance of being able to do it again. But imagine being able to click any result, and automatically and transparently reproduce that result. JU: In reality it won't necessarily be the case that you can punch a button and have everything replayed exactly. But having the documentation, at that level of detail, and in that form, would be an incredible asset. RB: Agreed. The hope is that here in External Research, because we're building these tools not just in the context of one science project, but many, you can have community tools that bridge communities. We're talking to people in the earth sciences doing atmospheric studies, and their workflows and analyses are so similar to what the oceanographers are doing. But right now, since those two communities aren't talking or sharing tools, it's very difficult for one community to interact with the other. JU: That's a really nice point. Well, thanks Roger! RB: See you later.