In this podcast, MSR researcher Catharine van Ingen and Berkeley micrometeorologist Dennis Baldocchi talk with Jon Udell about their collaboration on www.fluxdata.org, a SharePoint portal to a scientific data server. The server contains carbon-dioxide flux data gathered from a worldwide network of sensors, and provides SQL Server data cubes that help scientists collaboratively make sense of the data.
Dennis Baldocchi is a professor of biometeorology at Berkeley. His research focuses on the physical, biological, and chemical processes that control trace gas and energy exchange between vegetation and the atmosphere. He also studies the micrometeorology of plant canopies.
Catharine van Ingen, a partner architect with with Microsoft Research in San Francisco, does e-science research exploring how database technologies can help change collaborative research in the earth sciences. She collaborates with carbon climate researchers and hydrologists.
JU: Dennis, you're someone who's pulling together a worldwide network of CO2 monitoring stations. Can you briefly explain how these devices work?
DB: Sure. Let me give you a bit of history. Back in the late 1950s, David Keeling made some of the first measurements of carbon dioxide concentration -- on Mauna Loa in Hawaii, in the Arctic, very remote locations. They saw an increase in the C02 concentration in winter, and a decrease in summer. The increase is due to respiration in the biosphere, the decrease is due to photosynthesis. And on top of this they saw a trend due to fossil fuel combustion and logging of tropical forests.
These measurements were just C02 concentrations. As atmospheric scientists, we know that changes in the atmospheric concentration are due to fluxes. We measure actual fluxes: moles of carbon dioxide, per meter squared, per second, between the atmosphere and the biosphere.
We do it with a combination of sensors. One is a three-dimensional sonic anemometer, which measures up-and-down and lateral-and-longitudinal motions of the air, ten times a second. And simultaneously with new sensors we measure instantaneous change in CO2 concentration.
JU: So it's a combination of sensing wind speed and sensing atmospheric gas.
DB: Absolutely. We measure a covariance between the two, and theoretically that's related to the flux density.
JU: And this population of sensors has been growing for 15 or more years?
DB: Yeah, my old lab in Oak Ridge, Tennessee made some of the first sensors we were using in the early 90s. Around then a company called Licor started making a sensor that's about 15 centimeters long and shoots an infrared beam from source to detector. The air can blow through this sensor, and it's low power, doesn't need pumps, so it can be deployed in the middle of nowhere. Many of us run with solar power, so we have a PC that pulls an amp, then the sensor pulls another amp, so for two amps we can run a flux system.
JU: As Catharine points out, there's a long tradition of large-scale collaboration in some scientific disciplines, but it's relatively new in other areas, and it sounds like this is one of those.
DB: Yeah. I was a grad student in the 80s and I remember my professor having a desk full of data. People would knock on the the door wanting to borrow it, and there was always some reluctance, it was really a single-investigator culture at the time.
In many ways I credit our Italian colleagues, they were really gregarious and good at hosting wonderful workshops that started bringing people together.
JU: So Catharine, how did Microsoft get involved in building out the scientific data server that supports this project?
CvI: It was serendipity. We had met folks at the Berkeley Water Center two ways. First through Jim Gray's interest in e-science and database applications. Second, one of the current heads of the Berkeley Water Center is an old friend of mine from grad school, Jim Hunt. We were talking about doing a hydrology project, then somehow my colleague at BWC on the computing side, Deb Agarwal, ran into Dennis, and we started talking.
Dennis fit all of the criteria for how I like to engage with scientists. He was desperate, he had a problem that he didn't know how to solve, and that was important, because it meant he was willing to talk to us and teach us things.
Also he had enough data to make things interesting for us. It's not petabytes, but we're talking about the hundred-gigabyte range, and the dataset is extremely diverse. I find it fascinating from an informatics point of view because it's a true scientific mashup to do the data analysis. You're taking the flux data that Dennis just described, as well as a lot of site properties, and other things from the literature, and trying to bring it all together.
JU: There's a whole range of what you folks call ancillary data, which describes soil and vegetation and other aspects of the environment.
DB: To give you an example, the meteorological data, from a database point of view, is fairly simple and regular. Our loggers give us half-hour data, so you get what's essentially an Excel spreadsheet. The rows are timestamped for each half-hour, and the columns are temperature, flux of water, solar energy, and so on. But it gets complex when you weave in the ancillary data. For example, you need to know the population of leaves that control these fluxes. You may measure that in a half-dozen spots, a half-dozen times per year. Then you need to understand leaf photosythesis, and that's another set of measurements, and then soil texture, carbon, and water absorption, and all these measurements are at different depths, different times, it gets really complex.
CvI: Another interesting aspect, from our side, is handling time. We all think time is linear...
DB: [laughs] Not according to Einstein...
CvI: [laughs] ... well ... so, since we're dealing with plant information, plants photosynthesize during the day. So rather than using wall-clock time, using the plants to tell us about day or night was really fascinating. In effect we're deriving a time window based on the time series data themselves, and for informatics folks, this was more fun than a barrel of monkeys. We've generalized the concept now, and applied it to a couple of other disciplines. Handling time has turned out to be one of the biggest areas of learning.
JU: So what is FluxNet, actually, and how does the data get into the scientific server that you've built?
DB: It started at a workshop we held in Italy in 1995. From that, regional networks started blossoming. First off the ground was the EuroFlux network, then AmeriFlux in about 1997, then over time the Asians, the Canadians. NASA funded us for two cycles, and then things dried up as they decided to go to the moon and to Mars. Most recently we've been funded by NSF, which is funding a whole bunch of ecological networks. On the side, there's been funding to Oak Ridge National Lab, through NASA, to maintain the data acquisition and archive system. And then Deb and Catharine joined in to build value-added products through this FluxData project.
Sometimes I think we're like Tom Sawyer, we've got this fence to paint and all these people are helping us paint it.
JU: Or like stone soup.
CvI: It is like stone soup. From an informatics point of view, the way we think about it is that the data starts with tower owners -- and Dennis is a tower owner as well as a project overseer -- and flows to one of the network repositories, or directly to Oak Ridge where the data is stored.
JU: OK, so your site, www.fluxdata.org, is not the repository, it's for analysis...
CvI: Yes. There are data archive centers, funded primarily by NASA, where you can contribute data, and where data is stored. The challenge for the scientist is to get from the raw data to the science, it's a classic last-mile problem. So the data flows from the repositories to the folks in Europe who are doing gap-filling and uniform processing, and it flows back to Oak Ridge for long-term storage, and it flows to us.
We then make it available to researchers to download, and we provide the value-added summary products. So we're not at the front end gathering data, and we're not the archive, we're in the middle, solving that last-mile problem.
JU: Part of that solution is to put the stuff into data cubes. Dennis wrote somewhere that while these have been used in financial analysis for a long time, their application to scientific analysis is new. It might surprise some people to learn that this way of looking at data isn't common in the scientific world.
CvI: It actually isn't. OLAP databases, data cubes, have been around for a long time. I think I first saw one in the early 90s. But that was really commercial data, it was about finding how to make coupons for Oreos and milk. Scientific data is different in a couple of respects. First, it's much more dense. You tend not to always buy Oreos and milk together, but Dennis always reports CO2 flux, temperature, and precipitation together. The other difference is that a lot of the analysis for commercial data is not at the leaf nodes, it's about annual sales. Whereas a lot of science is actually at the leaf nodes, it's about looking at statistical variation in the half-hourly data.
So we end up building different-shaped cubes.
DB: And let me add that we'll present this data with gaps, for several reasons. One is that if there's a thunderstorm, it might cause the instrument to malfunction. Another is that we have to comply with meteorological steady conditions -- for example, steady winds. So we apply a lot of quality assurance to the data set, and that produces gaps, but any user of the data wants a continuous record. So we need to find ways to fill those gaps.
We also want to partition the fluxes, so we can understand mechanisms. We measure the net ecosystem exchange, but there's a component due to photosynthesis and a component due to respiration. By separating out day and night data we can derive these components, so there's all this value added to the data from the archive.
JU: So I looked at some of your pivot tables, for example on sites by vegetation -- how are those being used?
DB: To do cross-site analysis. For example, we're interested in how length of growing season may affect net carbon exchange. When I did this analysis before I met Catharine, I had to open a bunch of spreadsheets and cut and paste, cut and paste. With the cubes, you press a button and the data's there. It really allows you to do a lot of quick what-if questions, and be creative. It makes our work quicker and easier.
CvI: We're also doing a fair amount of sorting. You can sort along vegetation types, to see the difference between croplands and grasslands. We also know each of the sites that is a boreal forest, so you can look at just those, or just tropical forests. If the database has 900 site-years, you can select just the 200 that you need for a piece of analysis.
JU: Is it fair to say that until this was brought together it wasn't possible to do this?
CvI: It was possible, but just really tedious.
DB: Back when the network was small, we did a workshop in 2000, and we had about 100 site-years of data from 30 sites. It was easy to be clunky. But now we have 900 site-years from 400 sites, and you just can't use the old methods. We have to go modern.
JU: What kinds of collaboration effects are you seeing? You've written that it's a big challenge to motivate scientists to contribute the ancillary data in a standard way. Getting the stuff in front of people like this, in a common presentation with explanations about what all the variables mean, and how to report them, should help get everybody onto the same page.
CvI: I see a couple of things. First, we're starting to hear from individual tower owners asking us questions, and telling us what's wrong. "I'm sorry, my site isn't really at that lat/lon." Or: "My leaf index is really this."
They see their data being used in papers: we're hosting access for about 60 paper-writing teams. As the papers come to fruition, we're actually tracking what sites they're using, so it's possible to go in and find out who's using your data.
DB: It's motivating. I know my post-doc is so excited when she finds out people are using this data.
JU: That explains why you have an update feature on the site?
CvI: Absolutely. We know there are corrections that need to be made. Treating it as a living, breathing data set, and being able to respond in an organized way to changes...
DB: As more eyes look at it, they can help us fix it. Especially our own data. You look at it and don't see the problem, but when someone tries to use it...oops. In fact we found a problem with our solar heat flux recently. We were doing the correct calculations from 2000 to 2003, then we changed algorithms, and the staff changed, and all of a sudden there was a glitch in how the data were being processed. Finally some scientist from UCLA wanted to use the data, and he plotted it up, and found the problem. So now we're correcting that.
CvI: One of the things that happens when you plot data over time is that you can see any errors in time reporting. One site was off by a couple of months. The data looked fine when you plotted just that site. But if you plot it by nearby sites, suddenly you see the problem. That's the kind of processing -- bringing the data into focus -- that we're engaged in right now.
JU: So you've got the data online, and tools for viewing and updating the data, but there's also a conversational infrastructure. You have a blog, there are places for people to add comments and have discussions, and all of that is kept together with the data. Catharine, you've said that the role of data curation in science is emerging, and will be key as we increasingly see these mega collaborations with hundreds or even thousands of people working on the same data. You need an environment in which those conversations can be centralized in the same way the data is centralized.
DB: There's also almost a traffic-cop role too, just to avoid redundant efforts. There are several obvious ideas, and multiple groups may want to pursue them. In the long run it's a waste of effort if people are doing the same redundant analysis, and only one paper may get published. If we can get these people to talk to each other, and interact, that's critical.
JU: As Catharine puts it, investing the same effort in publishing data as you would in writing a paper is something that's not yet socialized.
CvI: No, it's not. We see again and again how difficult it is to put the data in a box and tie a bow around it, so people can reuse it. It's very hard, but very important, long-term, for a lot of these environmental problems.
DB: So Catharine, by marking these data sets and giving them some kind of provenance, is this a way scientists can get credit for the work?
CvI: Well, the challenge isn't only enabling that, but also teaching the funding agencies that it's just as important.
JU: Exactly. I've talked to Timo Hannay about this -- he's the guy who runs the web stuff for Nature Publishing -- and this is a huge interest of his. Science is an enterprise that runs on people getting credit for publishing papers, not data. I gather that often papers are published as a thin gloss on a data set, just to get the data out there. There hasn't been a model for publishing the data itself. The fact that the data from somebody's individual tower can be traced back, and then traced through its use in follow-on papers -- that's huge. Your post-doc can not only get excited about other people using her data, she can get credit for their citations of it.
JU: So, the climate effect of C02 is obviously a hot topic. What have we actually learned at this point?
DB: One paper used this network in combination with remote sensing to see how carbon exchange across Europe responded to the drought and heat wave in 2003. So here was this network poised to measure how the whole biosphere responded to this climate assault.
The network has also been successful with what we call emergent scale processes. One that came out strongly is that plant canopies respond to light more efficiently if the light is diffuse, as opposed to when there are clear skies. That's a process we haven't seen before.
Another thing we found, because we have continuous records, is that if there's a summer rain event, microbes turn on immediately and produce huge amounts of respiration that we never envisioned before. Scientists in the past would miss these extreme events, but by having continuous measurements we can see how the system responds.
JU: But you wouldn't argue for long-term trends in the 15 or so years of data you've collected?
DB: If there are long-term trends, they seem more related to ecosystem dynamics. Many of the forests under study were disturbed at the turn of the century, so they're going through that natural cycle of growth, maturity, and decay. Those ecological features lay on top of any potential climate trends.
JU: So it's more about having an infrastructure in place that allows us to have the data in hand, and then make some predictions?
DB: Yes. Now in fact, one of the things we are seeing is a change in the length of the growing season. As things have gotten warmer, the spring comes earlier, and it's really affecting carbon uptake in the citrus forests. But the big unknown is that if you have an earlier spring you might also get a summer drought, so you have an increase in carbon in the spring, and a decrease in summer, and the two factors may cancel out. But with our measurements we can see the mechanisms, we can understand and parse out what's happening and why. Whereas in the past, scientists would cut down trees and get tree rings and take one integrated snapshot for the whole year. But they wouldn't understand why, because those tree rings were also affected by drought and temperature and ozone and elevated C02 and other issues.
CvI: It's really a great time to be doing this stuff, because you're at the juxtaposition of social need, scientific need, and the availability of cheap technology.
DB: And our NSF grant encourages to do outreach, so this is a great opportunity to do that.
CvI: Jim Gray always used to point out that the post-docs are the ones in any collaboration who most embrace new technology, and move the entire collaboration forward. Knowing the guys over in Europe that's certainly true, and you can see it happening with your own post-docs, Dennis.
JU: So how are these cubes getting built, Catharine? What was the collaboration between you and the scientists?
CvI: We're lucky to be starting with a data set that is very well processed. As to building the rest, Dennis gave us, gosh, I looked at 300 hundred of his graphs. I also got a similar collection from two of his other colleagues. I went through all the graphs and papers to try to understand how the data is manipulated and displayed.
DB: That's a good idea. I didn't realize you did that.
CvI: Oh yeah. [laughs]
DB: That would be helpful, because you see the kinds of products we're trying to create from these databases.
CvI: Absolutely. I started by classifying the graphs into time-series graphs, scatterplots, and then everything else. Then I waded through how everything was sorted, searched, filtered, trying to figure out how to organize the data to enable that class of graphs.
DB So Catharine, there are a bunch of graphs I'd like to replot with this new database.
CvI: Well Dennis, you and I should have lunch and we should figure out how to rip out a bunch of graphs.
So, along the way we realized that scientists will often make 50 graphs, through away 48, and keep two. The ability to make a lot of graphs rapidly and simply usually requires some kind of scripting, and that's where you start leaving Excel and going into MATLAB or another scientific analysis tool.
DB: Yeah, I'm using MATLAB a lot nowadays, and I'm seeing things I never saw before. I like having the script files because it gives me some history of what I was looking at.
CvI: That's why we decided to connect MATLAB to the cube, so you can browse the reports we make in Excel, or go directly through MATLAB. Again, it's solving that last-mile gap to the scientist's house.
JU: Well this has been great, thanks!
DB: Yeah, thanks. Catharine, we should get together and talk about some graphs.
CvI: Thanks Jon. And thanks Dennis. Are you in your office? I'll call you later this afternoon.