ARCast.net - More Service Oriented Infrastructure with Mark Baciak (Part 2)
- Posted: Apr 09, 2007 at 8:26 AM
- 10,281 Views
- 1 Comment
Loading User Information from Channel 9
Something went wrong getting user information from Channel 9
Loading User Information from MSDN
Something went wrong getting user information from MSDN
Loading Visual Studio Achievements
Something went wrong getting the Visual Studio Achievements
So you have a web service now what? Who is using it? What happens if you need to shut it down, version it, update it? How will you manage the dependencies that others have on your service and how will you know if you have enough capacity in place to handle the load tomorrow? Creating a web service is easy. Creating a mission critical, enterprise wide service is not trivial. On this episode Mark Baciak and I finish the second half of our discussion on service oriented infrastructure.
-Ron
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums,
or
Contact Us and let us know.
Follow the Discussion
Oops, something didn't work.
What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in. You need to be signed in to Channel 9 to use this feature.What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in and view them all on your notifications page.sign up for email notifications?
This is the interesting thing about putting this stuff out on the Internet. I started looking at where people are listening, and they're coming from all over the world: from Argentina, from the Caribbean, from Russia, from Europe, from Australia, India, China. Everywhere, people are listening to ARCast. It's so cool! It's part of the global nature of this program.
We are once again going to bring in our intrepid field reporter, Mohammad Akif, who is one of our Microsoft guys up in Canada, who's at the Toronto Forum for Financial Services Architects. Mohammed, what do you have for us today?
So, Robbie, would you like to tell us something more about what is happening, and why do you think this is important for other architects to think about?
Why should architects know about this? I think architects are the key link between business and IT, and it's very important that they be in the loop all the time and define solutions to help have innovative solutions for business.
This is Mohammed Akif, reporting from the Financial Services Architect Forum in Toronto.
Thank you, Robbie.
You know, if you think technology is cool, I'm here to tell you that business is cool. It keeps the life blood of our IT projects flowing: and that's money. We need money, we need the businesspeople, and we need them behind us.
Speaking of that, I'm going to return now to the second half of my chat with Mark Baciak, who's one of my colleagues here in Redmond, about service-oriented infrastructure and the capabilities you need.
We started the first half of this last time, and we're going to pick up right in the middle, as Mark and I are talking about service-oriented infrastructure and capabilities. So, let's welcome Mark Baciak.
[applause]
'Who would you need to call if you had to shut it off tomorrow?' 'I don't know. I guess I would shut it off and my phone would ring. I would find out, right?'
You need to have a way of kind of connecting the human element. It's not just: add a web reference and go party. I might want to know, for example, that you're planning on using my service. And you're going to use the heck out of it. You're going to send me 5,000 transactions per second, so I'd better buy some more servers or something. So you guys built that into your system?
For example, I've got a zip code lookup service: Any application in our enterprise can use it, I don't care. But other services have very sensitive data in them. They might have sales information, or they might have your salary, for example. Do you want any application just to start using that and have access to it? Probably not.
In the enterprise, you have to have control gaps. Someone can add a web reference, but that doesn't mean they're allowed to call, even if they have the right credential.
So we actually vet it down to the application itself, to say, 'What's the business justification for this application calling in to this service?' Then the provider can not only evaluate the business justification, but also the SLAs that they expect: the response times, the time out, the exceptions, and the whole nine yards.
The first part of the equation was the subscription process--to know who they are. The second one comes from the analytical side, because when you're a provider, you have certain rights: 'I paid for this service,' which is typically how it happens, 'I already have an application who's using it, and I'm going to be giving this away for free to the enterprise, so I can decide who gets to use it, and what their SLAs are.'
What we typically found out, once we did that, is that as consumer load increased, the performance on their application suffered. So what do they do? They're like, 'Well, I'm paying for it. It's mine, so I'm going to shut you off.' [laughs]
Not only knowing what the consumers are, which is very important, but if they start using 70 percent of your load, from a call perspective, you can either throttle them back--which we have all the capabilities to do through a wide variety of ways--or you can start to do something which everyone hates: make them pay for it.
That's actually a big, big obstacle to a lot of organizations. They're like, 'I don't want to just let people use my server. They're going to start sucking all the life out of my app.'
Very quickly, as you start opening this up to the enterprise, you have to think about what the billing structure for this is. What you typically start to do is your service becomes a shared service. It becomes a service that the business relies on for a multitude of business units. Well, guess what: everyone has the share the cost of increasing the infrastructure to support that service.
So if I'm on two servers, and all of the sudden I find that I'm not able to meet my SLAs, or my processors are pegged at over 80 percent, I need to add more hardware. Well, who's going to pay for it? My budget's stuck.
What I want to do is then offset those costs by saying, 'Look, if you want to use this service--and we know it's a critical service--then I have to take from everyone's cost centers, based upon how much they use it, in order to pay for this type of hardware.'
Then you start making it clear for people who find it beneficial for them, also, that nothing in life is for free, and they very quickly realize that.
However, the billing element of services and their consumption is further along the curve. Right now, as far as enterprises are concerned, services are for free. That model is going to radically change in the next two years.
Let's talk about that runtime for a minute.
So you're saying, "I'd like to do a call", and it gets ready to go out. What we do is we intercept at the invoke basis and we do a whole series of things inside of our runtime.
First of all, we figure out how you've authenticated so we know who you are, we know what the application is; we know who the consumer is.
We then send this information up to our security token service. This is that service that's in front inside of our repository. We get all the details back. First off, do we even know who you are? Even though you have a token, are you a known person inside of our feed? Are you trying to mimic, are you trying to play something bad, so on and so forth.
Let's say that you're all good. Then we'll send you back your uniform principal identity in roles. We'll also send you back a token. This is a trusted token so when you make a call to another system, it knows that you came from our environment, and you've been validated and authenticated. So that pretty much kicks off part of the process. But we also do a whole bunch of dynamic stuff from there.
With every single service there's a policy that goes for it, and inside that policy we dynamically determine many things. First of all, we go against the policy to make sure can you even do this call? So we're not spamming the line and making the service do the heavy lifting saying no, you can't call here; no, you can't call here. And if they're not able to make that call we cut them right there.
But if you pass through there then we dynamically figure out many things. We dynamically figure out what priority you are. Are you a high priority, are you a low priority, because that affects our routing solution.
We also look for the stage that you're in. If your application has been flagged as in test, we're not going to let you make calls to the operational side. Because you don't want to corrupt the data. That's another stopgap that we put there that helps with our routing solution.
Our next one is we look for timings. We look at things like what is your response time. And we look at this in two factors: we look at how responsive should it be, and when should we time out. This is the physical abortion.
Why you want to keep one higher than the other is because normally my response time is, say, 70 milliseconds, but my timeout is ten seconds. Well, it's a very low response time, but just in case the service is just overloaded with jobs or maybe in a complex operation, you want to give it enough time to complete on the server side. So technically, you'll bust your service-level agreement, but you're not killing the other side of the equation.
So this kind of gives each side a balance, not only on the consumer side, but on the provider side. Once all this is done, we dynamically calculate your transport options. These are things like signing, encryption, and compression. We figure out what we need to do to the packet before it actually hits the wire.
Once all this is done, we figure out which endpoint you're actually going to route to, we insert all these options and then we use, at the time we used web service enhancements, many different versions in order to be compliant from specification standpoint on the wire.
On the wire it was all WSTAR compliant. It goes to the other end, and then it hits the other Alchemy runtime. We unravel it; we do what's called partial trust. So even though it looks like everything's OK, we make another call to the STS, just to make sure you're not doing anything you shouldn't be doing, nothing nefarious. Then we do the other points of evaluation to make sure other parts that come into play, we hand that off to the actual code itself, it comes back, we do some more metrics.
So everything that's happening here, all these different passes, we're actually also metering them so we can tell how long we were in the wire, how long we're actually in the service code itself, how long we're in the wire back, and these are all great points for analysis.
Then the client comes back and then you log everything that happened from the request as well as the response if you're in the request/response phase.
If you use some competing technologies, just to do that part is anywhere from 100 to 400 milliseconds, and that's just to do that one small part, much less all these duties that we actually defined. So what we did, through the architecture, through the coding of it, we got it down to one millisecond.
So in the configuration or repository or whatever it is, somehow you mark my app as in a test stage, and your service has a test stage service, and you're going to allow me to call that stage?
[laughter]
It's not just a one-to-one; you can't say that "Hey, you can only go to test.' So if my stage happens to be dev, test, or uapp and I only have two services up, that might be a test service and my operational service. So dev, test, and uapp all get routed to the testing service, and once I flag it as production operation, then you can route to production.
Because we love to see how this reacts to our environment. A prime example, we had one of our services out there; very critical service, it had 60 consumers. That's a lot of consumers.
And of course, they'd been in production for several years so they had tons of people who were dependent upon this information, and then they did this major update. And what changed? If you actually did the analysis of it, the data contract didn't change, the message contract didn't change, and the policy didn't change; so as far as we were concerned, there was no change.
Everyone upgraded. They sent out the notification that everyone should test and go in and whatever; but we didn't verify it. We just kind of said, "OK, we just kind of trust you to do this. You're the provider, the onus is on you to figure out if it works or not.'
Well, production came, and everything went off, everyone's happy, yeah, everything's running. Well, two weeks later, fire drill. 57 applications upgraded, but three didn't.
The three who didn't were now corrupting the database. Because how they had the message contract defined was pretty tight. The data contract was a dataset of this one call and they were using the ordinals differently; that's the number, the index on there, so one went to seven or something like that. So they were completely putting the wrong data in the wrong columns.
What do they do? The first thing: They try to figure it out themselves. They had this entire system that they're built on top of, and they decide not to leverage it. So, they go back into their database, they're looking at things and going nuts.
Finally they realize, hey, let's call the Alchemy guys. Within five minutes of calling us, we told them the three applications, we told them every user who's done it, and we told them every call and when it was made.
Why? Because it was throwing exceptions. It was like, "Oh yeah, we've got your exceptions right here, we know the applications that are doing it." The worst part about it, they were being notified by this. [laughter]
Luckily, we have a data warehouse, but unfortunately our data warehouse has to be archived every half day, because we produce over ten gigabytes of data every half day. So, it was pretty extensive, we actually had to roll back.
Since we knew the date range, we had to go back to the tapes, replay the tapes, get off the information, and then pull out all the transactions, which wasn't too hard to.
Then we actually showed them the requests. We showed them the requests before, so we showed them the data we got back, the data that they changed, and the exception that came back. So, they were very quickly able to solve the problem.
Now when I click breaking change, we can actually go through the analytics to make sure everyone hit the testing environment, and if they were throwing exceptions or not.
If we see too many exceptions, they also basically haven't gone through the validation process. So, if they meet the threshold for whatever the account is, established by the provider, and the provider also goes through and does the visual verification and then ticks them off, then they're ready to go.
So, what would happen in the future, if 57 of those applications had been approved, once that thing hits production, the three that didn't get turned off? They can no longer make calls to that service.
Like I said, there's no way we could have planned that the data contract and message contract and the policy were not enough.
A data contract and the message contract, if it changes that's a syntactic change in the messages. But if you have something that's an opaque data structure like a data set, at the syntax level it just says, "Oh, it could be anything." It's like an XSD "any" tag, right?
So, it can't be validated syntactically, which means you have to rely on semantic validation like, "The ordinal at this one should be this, and this ordinal should be that." Semantic validation is so easy to mess up. It allows the human element to totally screw it up, which it did in this case.
We talk about dependencies in the system, so we actually have this thing called Service Modeling Language, SML, which is a rename of what we used to call System Definition Model, which is how we actually model systems at Microsoft.
We are actually making it so that you can model that environment. We try to put more transparency into what the dependencies are of that service. And so if something does a major change, we're actually able to flag it so that we know there is another breaking change and where that breaking change occurs.
It's those discrepancies in data that unfortunately only a human can make semantically meaningful. We're not to the point where cyborg computers are able to rationalize.
If my code is throwing exceptions about something that's going on, I'm not going to update the database. I'm going to make this exception so painful that somebody's going to have to notice it and you're not going to ignore it for two weeks.
So, there's always a missing link. With respect to, how do I put that in my code, how do I become more efficient? We're getting better with our tooling; we're making it so that people can focus on the business problem. Our prime example in.Net 3.0 and even 2.0, we have this new thing called end-to-end tracing. Are you familiar with that?
Well, instead of me having to go through and put all these quality gaps in each of these different ones, it just bubbles up. I'm actually able to capture it, see what's going on, and provide that as root cause analysis. Very powerful. Newer technology, but...
The runtime component itself, it would be kind of a mix. What we'd like to do is couple it with System Center. They've got a whole new series of things that are coming out, specifically for web servers' management.
I've actually talked Dejanje to basically their web services guru over there. We're talking about if we do reference architecture for SOA inside of architecture strategy, to actually incorporate System Center to show how they can actually do that using a product, so they don't have to write their own.
I know that these are interesting things because they're the kind of things you have to think about once you get beyond the simple web service. You know, like "Hey, here's my web service. Add a web reference, start calling it, woo-hoo. We're done. That's service-oriented architecture."
Well, that's just the beginning, that's baby steps. If we're going to do this on an enterprise-wide scale, you're really going to need more and so much more.
In fact, I've got another very cool episode of ARCast, just recorded the other day, which is going to have a lot more to say about this. But that's coming in the future.
I'll hope you'll keep listening and send me a note to ARCast@microsoft.com.
We'll see you next time on ARCast.
Remove this comment
Remove this thread
close