Cluster computing for the classroom
- Posted: Mar 27, 2008 at 4:35AM
- 1,721 views
Right click “Save as…”
Kyril Faenov is the General Manager of the Windows HPC product unit. Before founding the HPC team in 2004, Kyril worked on a broad set of projects across Microsoft, including running the planning process for Windows Server 2008, co-founding a distributed systems project in the office of the CTO, and developing scale-out technology in Windows 2000. Kyril joined Microsoft in 1998 as the result of acquisition of Valence Research, an Internet server clustering startup he co-founded and grew to profitability by securing MSN, Microsoft.com and some of the world's other largest web sites as its clients.
Rich Ciapala is a program manager in Microsoft HPC++ Labs, an incubation team within the Windows HPC Server product unit. Rich joined Microsoft in 1992 and has held a number of different positions in technical sales, Microsoft Consulting Services, the Windows Customer Advisory team and the Visual Studio product team.
JU: What Rich just demoed, which we'll show in a screencast, is how a financial model can be deployed to a server that acts as a front-end to a compute cluster. It's a nice easy way for students to use a model developed by a professor, select a basket of securities, run a very intensive computation on them against large chunks of data, and get answers back in an Excel spreadsheet. The bottom line is that the students can run an experiment using a level of computing power that was never before so easily accessible.
KF: Yeah, because of the complexity involved in deploying systems like that, acquiring the data, and curating it, a lot of universities don't have this kind of infrastructure in place. So for a number of students who haven't done this before, this will make it available for the first time. For others who have, it will make it quite a bit easier.
JU: Now these are not computer science students who are learning about high performance computing, and about writing programs for parallel machines, these are students who are learning about financial modeling, and this just makes a tool available to them that can accelerate that.
KF: Precisely. Most of our HPC customers are scientists, or engineers, or business analysts, not computer scientists. They're folks who use mathematics, statistics, differential equations ... sometimes not even math directly, but applications that encode these mathematical models to do research, or engineering, or risk modeling, or decision making. To them it's just a tool, and they want to use it in the way they use PCs today, as transparently and straightforwardly as possible.
JU: What's the situation today for most people? In the case of the covariance model Rich showed in the demo, if it weren't being done like that, how would it be done?a
KF: You can do it in Excel, or MATLAB, or SAS, on the workstation. So you'd acquire the data, and use your preferred tool ...
JU: ... and wait a long time ...
KF: ... and wait a long time. And if you want to do a significant amount of data -- like a year's worth, for a large number of stocks -- it might not even be possible at all.
Or you might load it up into a server, but then you have to figure out how to write an application, how to deploy it out to the server, then figure out how to submit the data to the model, pull it back, integrate into the visual analytic process.
This multi-step process is exactly what our HPC customers are running into. They're expressing the models and doing the design on the workstation, using any number of tools. They do the analysis of the results, and visualization, on the workstation. But large-scale computation runs somewhere else. It might be in their organization, it might be out on the Internet, but it's a very disjointed process.
JU: There are clusters out there in academia, and there are people doing these kinds of things, but the point is that hasn't been woven together yet.
KF: That's right. In 2004 the U.S. government published an assessment of U.S. competitiveness in high performance computing. The first recommendation was, and I'm quoting:
Make high performance computing easier to use. Emphasis should be placed on time to solution, the major metric of value. A common software environment that spans desktop to high-end systems will enhance productivity gains.
That's what we're starting to see in the HPC community. Not just getting the systems running as fast as possible, but figuring out how the workflow, the creative element of the scientific process, can be optimized.
JU: So, Rich and I talked about the particular model used in his demo is in a class called parameter sweep, which he distinguishes from the more distributed and chatty kinds of applications. In this case, you can send a batch of data down to a node, it can think about it for a while then give back an answer, and there doesn't need to be much communication. Is that the optimal scenario for this architecture?
KF: Actually, it's optimized for a broad range of HPC applications. In fact, the major goal of the first release of the product, Compute Cluster 2003, was MPI-style [message passing interface] applications. There are a lot of these in engineering and in the environmental space. You're modeling some kind of physical process, and you build a mesh or grid that takes a large physical process or body, partitions it, does computations on local areas, but then has to frequently exchange data across the partitions. Think about a car crash simulation. You might partition the hood of the car into a lot of pieces, every one computed separately, but as the deformation is happening the forces need to be exchanged. Or weather modeling, where heat exchange happens across partitions.
JU: There's a high degree of data interdependence.
KF: Exactly. When you you have an interdependent problem, you use MPI for that. We worked with the team at Argonne National Labs that releases the open source reference implementation of MPI, and we've adopted that in our product, optimized the performance and security on Windows, and integrated it into the stack.
JU: Right, I knew about the MPI layer in the cluster product. But it seems that the system we're looking at here, for professors to enable students to experiment with financial modeling -- that one is targeting the other class of application
KF: Right. There is a large class of what we call embarrassingly parallel problems, a lot of statistical analysis falls into that category, and media rendering, where you have a lot of independent tasks. And that's what we have here, because every pair of instruments that needs to be compared is an indepdendent task. What you need to do is spray those tasks across a cluster. We have a solution that makes that much more approachable.
JU: So in this case, that entails mapping the input parameters to a set of work items.
JU: OK. And outside the financial domain, where else will this style be popular?
KF: We'll see this in a range of disciplines. This particular example uses data from an external source -- in this case, the stock market -- and it's looking for patterns of correlations between different signals. This paradigm is broadly applicable. If you think about, for example, clinical research, where you have data coming in from hundreds of patients, where the data includes many parameters about their health condition, and you're looking for disease markers or drug reactions -- you're doing correlation analysis among the diffeerent signals.
Or you might have data coming in from sensors deployed in oil and gas pipelines for safety monitoring, or environmental sensors, everywhere you have instruments producing high volumes of data, where you need to find patterns in data, and optimize the scientific process of developing models that produce insight into the data.
JU: Would you say that these embarassingly parallel problems are low-hanging fruit?
KF: Very much so. And there's another class, Monte Carlo simulation, a method used very effectively across a range of industries to statistically explore different scenarios, for risk analysis and predictive model. It's used in financial services, like insurance, but things like process management in factories can also use it, or logistics chains.
JU: So for the current example, give us a sense of what skill set is required of the professor in order to create the model and make it available to students. There's some .NET programming involved, right?a
KF: Rich, do you want to take this?
RC: Well, you pick your .NET language of choice, and your development environment, which may be Visual Studio. We're making the data available in terms of LINQ, so you need some understanding of that, although for the queries typical of these applications it's fairly basic. And in fact, since it's integrated into the language and you get things like syntax completion, it's probably easier than writing SQL.
JU: There's a framework provided, what does that include?
RC: It does two things. First, it forces you to define the interface for your model in such a way that you can easily build, for example, an Excel front-end to send input and retrieve output. Second, it shows you exactly where you need to do the splitting of the tasks into work items, where you do the spraying of work items to the cluster, and where you put the code that does the covariance and correlation calculations.
KF: The professor focuses on writing the analytics parts, and doesn't have to worry about the fairly complex workflow skeleton that submits the data to the cluster, partitions the work, accessing the results, and then performing the final reduction.
JU: So can focus on creating the pivot table, or using MATLAB, which is where I'd rather be spending my time.
KF: Yes, in a domain you're expert in.
JU: So, who are the guinea pigs for this system?
RC: Our first two are the University of Washington, which did the model we demonstrated, and the University of North Carolina in Charlotte.
JU: Kyril, I know you have big ideas about where this can go. Why don't you paint the picture?
KF: When we started the HPC team at Microsft, we realized it's an actively evolving space. But Microsoft is fairly new to it. Without the benefit of 20 or 30 years of experience, we felt we needed to do something that would help us develop expertise and build up an understanding of not just the technology, but also the usage patterns. So we worked with, and funded, 10 universities worldwide, and that's been very helpful.
We've also created an internal team whose mission is to do incubation. The goal of this team is threefold. First, to prototype and demonstrate the end-to-end solutions that our HPC customers will find beneficial, and what Rich has demonstrated is an example of that.
Second, to help us explore the trend that we see as HPC becomes more and more data-driven. There's still the world where you run simulations, of car crashes or weather. But a lot of new applications are mining data for insight, and doing it in a computationally intensive way. That changes the formula for how HPC is used. In many cases it's becoming impractical to put clusters in customer locations, if you have to ship terabytes or petabytes of data around. Data repositories are starting to act like black holes, if you will, that are pulling computation towards them.
JU: I'm sure that's true in the climate area...
KF: Climate, biology, astronomy, geosciences, everywhere that you start accumulating tremendous data sets. We think there's going to be way that Microsoft can help customers optimize how these services are built, because there's no established architecture today.
JU: Jim Gray was always talking about how it's becoming necessary to Fedex hard disks around the world because there's no other way to move the data to the computation. But instead you're proposing to move the computation to the data.
KF: That's right. We want to incubate a few of these high-value data-centric services, and demonstrate the best practices for doing that while providing free access to academic institutions. That'll help us understand what's involved in operating these services, and potentially we might imagine Microsoft running a few of them.
Then the third goal for the incubation team is to flow the requirements for doing these things into software, so that customers can do this as easily as possible themselves. One of the challenges today is that there's a dichotomy between these very large-scale Internet services being built -- by Microsoft, Yahoo, Google, and others -- but they're in their own world. Customers can't take a slice of that infrastructure and deploy it in their environments.
At the same time, we keep on building off-the-shelf software that people install on their infrastructure, and we're just now learning what it takes to run HPC services using that software. So we want to make sure there's a tight coupling between the team that builds the prototypes and runs the services, and the team that implements off-the-shelf software, such that we run our services using the products that we build. And at same time, we want to make it a turnkey operation for customers to stand up these services themselves.
JU: That's a key point, so let's underscore it. We're seeing the emergence of a small set of what I call intergalactic clusters, which are one-of-a-kind things, and they are not replicable. They do interesting and powerful things, but you can only do things with them on their terms.
Your notion is that you want to maintain parity, and ensure that you can always replicate what's happening in the cloud if you need to.
KF: Exactly. For example we just talked about the gravitational pull of data. Imagine you have an astronomy site that accumulates a petabyte. You can try to put it on one of these intergalactic clusters, but that's maybe not what you want. Maybe the most optimal thing is for you to stand up a 1000-node cluster with each node having a terabyte of disk. We want to enable that. We want to be able to tell our customers: Here's how we run this large-scale data-driven HPC applications, and here's how, within a day or two, you can stand up one of these yourself.
JU: So you even see some potential consumer applications for this, don't you?
KF: Sure. Think about search. We can only find answers to questions that have already been answered. But imagine if your questions require novel insight to data. For example, Microsoft HealthVault is starting to accumulate a lot of health data.
JU: Right, so what are my cancer survival prospects given the specifics of my case, and in light of a large body of data about other people?
KF: Or help me do a predictive analysis on my risk of flood or hurricane damage, not for the region in general, but for my house, given the weather and geographical information that's available, and maybe given a few sensors that report data specifically for my house.
To enable these applications, you have to create a platform that makes it possible to curate data, and develop applications that run on top of it. What you see in the service we just demonstrated is a first example of that.
JU: OK, thanks guys.