- More Service Oriented Infrastructure with Mark Baciak (Part 2)

Sign in to queue

The Discussion

  • User profile image
    Announcer: It's Monday, April 9, 2007, and you're listening to ARCast.
    Ron Jacobs: That's right, and what a week it has been here at ARCast studios. We've been watching the statistics on the website, and just incredible numbers. I don't know what happened last week; it's like everybody and their brother went out and downloaded all the episodes of ARCast or something. It just went berserk.

    This is the interesting thing about putting this stuff out on the Internet. I started looking at where people are listening, and they're coming from all over the world: from Argentina, from the Caribbean, from Russia, from Europe, from Australia, India, China. Everywhere, people are listening to ARCast. It's so cool! It's part of the global nature of this program.

    We are once again going to bring in our intrepid field reporter, Mohammad Akif, who is one of our Microsoft guys up in Canada, who's at the Toronto Forum for Financial Services Architects. Mohammed, what do you have for us today?
    Mohammed Akif: I have with me here Robby J. Ron, from the Bank of Nova Scotia. He's represented an excellent view of what he's doing at the Scotia Bank, in terms of aligning the business and IT initiatives and architecture.

    So, Robbie, would you like to tell us something more about what is happening, and why do you think this is important for other architects to think about?
    Robbie J. Ron: Prior to what we've been doing recently, it was mostly going to business, being very reactive to what business wanted, and then delivering to their request. What we've done now is, we have a process in place where we are very, very proactive. We plan, and we work together with business, in terms of getting the IT plan defined, and go back to them, and they're happier now.

    Why should architects know about this? I think architects are the key link between business and IT, and it's very important that they be in the loop all the time and define solutions to help have innovative solutions for business.
    Mohammed: Excellent. So how has this project actually helped you increase the profile of IT? Is it something that has improved the relationship with the business or helped you prioritize? What are some of the benefits that you've gotten out of putting a framework like this together?
    Robbie: All of the above. I think business loves it. The feedback we're getting from them is they've always wanted a tool like this and a framework where they could actually play together with IT and planning; this is exactly something they wanted. The relationship manager framework that we've built is also very useful. We cater to the business head and kind of help define what he wants.
    Mohammed: Excellent, Robbie. Thank you very much. Once again, this was about the framework that Bank of Nova Scotia has built and is using, with respect to aligning their IT projects and keeping track of how they align to the business strategies, and updating that on a regular basis so that it doesn't become an obsolete document.

    This is Mohammed Akif, reporting from the Financial Services Architect Forum in Toronto.

    Thank you, Robbie.
    Robbie: Thank you.
    Ron: Thank you, Mohammed.

    You know, if you think technology is cool, I'm here to tell you that business is cool. It keeps the life blood of our IT projects flowing: and that's money. We need money, we need the businesspeople, and we need them behind us.

    Speaking of that, I'm going to return now to the second half of my chat with Mark Baciak, who's one of my colleagues here in Redmond, about service-oriented infrastructure and the capabilities you need.

    We started the first half of this last time, and we're going to pick up right in the middle, as Mark and I are talking about service-oriented infrastructure and capabilities. So, let's welcome Mark Baciak.

    Ron: One of the things you guys did that I thought was a great policy. It's a great idea that people ought to do, but I don't see many people doing it, is the notion of arranging between human beings the service agreement.
    Mark Baciak: [laughs]
    Ron: A lot of people are like, 'Hey, I put out a service, and people are using it.' And you go, 'Oh, great. Well, who's using your service?' 'I don't know. I just get all these calls coming in...'

    'Who would you need to call if you had to shut it off tomorrow?' 'I don't know. I guess I would shut it off and my phone would ring. I would find out, right?'

    You need to have a way of kind of connecting the human element. It's not just: add a web reference and go party. I might want to know, for example, that you're planning on using my service. And you're going to use the heck out of it. You're going to send me 5,000 transactions per second, so I'd better buy some more servers or something. So you guys built that into your system?
    Mark: Yeah. There are actually two parts to that question: The first one is about subscriptions. Subscriptions are just a way of a consumer telling the provider that, 'I like to use your service.' Most of the time, how services are created, the provider just likes to know.

    For example, I've got a zip code lookup service: Any application in our enterprise can use it, I don't care. But other services have very sensitive data in them. They might have sales information, or they might have your salary, for example. Do you want any application just to start using that and have access to it? Probably not.

    In the enterprise, you have to have control gaps. Someone can add a web reference, but that doesn't mean they're allowed to call, even if they have the right credential.

    So we actually vet it down to the application itself, to say, 'What's the business justification for this application calling in to this service?' Then the provider can not only evaluate the business justification, but also the SLAs that they expect: the response times, the time out, the exceptions, and the whole nine yards.
    Ron: You would also need to identify the human element, like, 'Who do I call if I see that this app is sending me garbage?' or that sort of thing.
    Mark: Yeah. That's on our whole service management side. Going back to the second part of the equation.

    The first part of the equation was the subscription process--to know who they are. The second one comes from the analytical side, because when you're a provider, you have certain rights: 'I paid for this service,' which is typically how it happens, 'I already have an application who's using it, and I'm going to be giving this away for free to the enterprise, so I can decide who gets to use it, and what their SLAs are.'

    What we typically found out, once we did that, is that as consumer load increased, the performance on their application suffered. So what do they do? They're like, 'Well, I'm paying for it. It's mine, so I'm going to shut you off.' [laughs]
    Ron: Yeah. Right. [laughs]
    Mark: They didn't realize that there are responsibilities that come into play, too: making sure that they're meeting their SLAs that their uptime is what they said it was going to be, and so on and so forth. We have to also think about how we monetize this.

    Not only knowing what the consumers are, which is very important, but if they start using 70 percent of your load, from a call perspective, you can either throttle them back--which we have all the capabilities to do through a wide variety of ways--or you can start to do something which everyone hates: make them pay for it.
    Ron: Yeah. [laughs] You're going to send me that many transactions; you're going to pay for it.

    That's actually a big, big obstacle to a lot of organizations. They're like, 'I don't want to just let people use my server. They're going to start sucking all the life out of my app.'
    Mark: Who pays for it? At the end of the day, we have a certain amount of budget, and in your SLA, you control your consumers, so you knew how many people were going to use it--not only from the application perspective, but from the users of that application. You knew the community, and you did all your tests for that.

    Very quickly, as you start opening this up to the enterprise, you have to think about what the billing structure for this is. What you typically start to do is your service becomes a shared service. It becomes a service that the business relies on for a multitude of business units. Well, guess what: everyone has the share the cost of increasing the infrastructure to support that service.

    So if I'm on two servers, and all of the sudden I find that I'm not able to meet my SLAs, or my processors are pegged at over 80 percent, I need to add more hardware. Well, who's going to pay for it? My budget's stuck.

    What I want to do is then offset those costs by saying, 'Look, if you want to use this service--and we know it's a critical service--then I have to take from everyone's cost centers, based upon how much they use it, in order to pay for this type of hardware.'

    Then you start making it clear for people who find it beneficial for them, also, that nothing in life is for free, and they very quickly realize that.

    However, the billing element of services and their consumption is further along the curve. Right now, as far as enterprises are concerned, services are for free. That model is going to radically change in the next two years.
    Ron: This was very common on the mainframes, that you had to pay for your processing time on the mainframes, and then we all went to PC servers, and we were like, 'Hey, it's all free again!' And now we're going to go back. [laughs]
    Mark: [laughs] Oh yeah. Everything's cyclical. Welcome to IT.
    Ron: One of the interesting ways you kind of implement this subscription idea is by building this Alchemy runtime that is also operating on the consumer side of the service, whether it's on a client PC, or another server, or whatever. If I'm consuming a service, there's a runtime that does a number of things when I'm going to make a call to a service.

    Let's talk about that runtime for a minute.
    Mark: Yeah. You can think of the runtime, first off, as an interceptor. It exists at either end, so it can exist either at the consumer end or at the provider. So when you do a call, such as an IML, which you actually create inside your application to do a service call, and you either do invoke or begin invoke, right.

    So you're saying, "I'd like to do a call", and it gets ready to go out. What we do is we intercept at the invoke basis and we do a whole series of things inside of our runtime.

    First of all, we figure out how you've authenticated so we know who you are, we know what the application is; we know who the consumer is.

    We then send this information up to our security token service. This is that service that's in front inside of our repository. We get all the details back. First off, do we even know who you are? Even though you have a token, are you a known person inside of our feed? Are you trying to mimic, are you trying to play something bad, so on and so forth.

    Let's say that you're all good. Then we'll send you back your uniform principal identity in roles. We'll also send you back a token. This is a trusted token so when you make a call to another system, it knows that you came from our environment, and you've been validated and authenticated. So that pretty much kicks off part of the process. But we also do a whole bunch of dynamic stuff from there.

    With every single service there's a policy that goes for it, and inside that policy we dynamically determine many things. First of all, we go against the policy to make sure can you even do this call? So we're not spamming the line and making the service do the heavy lifting saying no, you can't call here; no, you can't call here. And if they're not able to make that call we cut them right there.

    But if you pass through there then we dynamically figure out many things. We dynamically figure out what priority you are. Are you a high priority, are you a low priority, because that affects our routing solution.

    We also look for the stage that you're in. If your application has been flagged as in test, we're not going to let you make calls to the operational side. Because you don't want to corrupt the data. That's another stopgap that we put there that helps with our routing solution.

    Our next one is we look for timings. We look at things like what is your response time. And we look at this in two factors: we look at how responsive should it be, and when should we time out. This is the physical abortion.

    Why you want to keep one higher than the other is because normally my response time is, say, 70 milliseconds, but my timeout is ten seconds. Well, it's a very low response time, but just in case the service is just overloaded with jobs or maybe in a complex operation, you want to give it enough time to complete on the server side. So technically, you'll bust your service-level agreement, but you're not killing the other side of the equation.

    So this kind of gives each side a balance, not only on the consumer side, but on the provider side. Once all this is done, we dynamically calculate your transport options. These are things like signing, encryption, and compression. We figure out what we need to do to the packet before it actually hits the wire.

    Once all this is done, we figure out which endpoint you're actually going to route to, we insert all these options and then we use, at the time we used web service enhancements, many different versions in order to be compliant from specification standpoint on the wire.

    On the wire it was all WSTAR compliant. It goes to the other end, and then it hits the other Alchemy runtime. We unravel it; we do what's called partial trust. So even though it looks like everything's OK, we make another call to the STS, just to make sure you're not doing anything you shouldn't be doing, nothing nefarious. Then we do the other points of evaluation to make sure other parts that come into play, we hand that off to the actual code itself, it comes back, we do some more metrics.

    So everything that's happening here, all these different passes, we're actually also metering them so we can tell how long we were in the wire, how long we're actually in the service code itself, how long we're in the wire back, and these are all great points for analysis.

    Then the client comes back and then you log everything that happened from the request as well as the response if you're in the request/response phase.
    Ron: So a lot of people are going to go, "Wow, that sounds like it's really slow; sounds like it takes a lot". Does this add a lot to the call time?
    Mark: This is a painful part. The runtime is 80,000 lines of code, and I wrote them all myself, so I know it very well. Unfortunately, I rewrite a segment as much as twelve times to squeeze out as much time as we could. Because originally, we were looking at just doing authentication and authorization.

    If you use some competing technologies, just to do that part is anywhere from 100 to 400 milliseconds, and that's just to do that one small part, much less all these duties that we actually defined. So what we did, through the architecture, through the coding of it, we got it down to one millisecond.
    Ron: Wow! That's amazing, wow.
    Mark: It was a lot of work.
    Ron: There's a very interesting side effect of building this runtime, and this interception layer gives you a lot of options, like you said. Let's say I want to subscribe to your service and you go "OK, I might let you do it, but first you have to pass some testing, because I'm not going to let your app just call my production system until I know you're doing it right".

    So in the configuration or repository or whatever it is, somehow you mark my app as in a test stage, and your service has a test stage service, and you're going to allow me to call that stage?
    Mark: Yes.
    Ron: And then when you're happy you go into the configuration panel and you say, "OK, Ron's app is now approved", and no code changes, just in this dynamic routing says, "Oh, you're now approved. You can call the production one.'
    Mark: Everything that we've done was built on.NET technologies, which gave us a lot of benefits. Not only for the time it takes to produce them, but we have this neat thing in.NET called the FQAN - you should know this one - the fully qualified assembly name.

    Mark: Which is basically a PKI part of signing code. Not only do you have your namespace and your version but you also have that token that goes with it.
    Mark: So what we do is: any consumer, first of all, has to have a FQAN. So it has to be signed. That offers us two things: first of all, if you're in a signed assembly that means you're in our portfolio management system, so you no longer can be a shadow application.
    Ron: Yeah, OK.
    Mark: It guarantees us that we know beyond a shadow of a doubt that you are that consumer.
    Ron: OK.
    Mark: So that offers us a lot of flexibility. We know based up on that FQAN what stage you're in, because you tell us.

    It's not just a one-to-one; you can't say that "Hey, you can only go to test.' So if my stage happens to be dev, test, or uapp and I only have two services up, that might be a test service and my operational service. So dev, test, and uapp all get routed to the testing service, and once I flag it as production operation, then you can route to production.
    Ron: OK.
    Mark: So it's completely dynamic on how you choose those equations.
    Ron: I mean, it's a good practice, right? I think a lot of people initially start out just "Yeah, let them call our production service", until they have that first blowup.
    Mark: Oh, no, no, no. [laughter]
    Ron: Then they're like "We should test this thing first".
    Mark: Yes, because they help to destroy it. You can never guess how the users are going to use it. And we thought we were pretty smart guys. All of a sudden, our users prove us wrong, and we loved it.

    Because we love to see how this reacts to our environment. A prime example, we had one of our services out there; very critical service, it had 60 consumers. That's a lot of consumers.

    And of course, they'd been in production for several years so they had tons of people who were dependent upon this information, and then they did this major update. And what changed? If you actually did the analysis of it, the data contract didn't change, the message contract didn't change, and the policy didn't change; so as far as we were concerned, there was no change.

    Everyone upgraded. They sent out the notification that everyone should test and go in and whatever; but we didn't verify it. We just kind of said, "OK, we just kind of trust you to do this. You're the provider, the onus is on you to figure out if it works or not.'

    Well, production came, and everything went off, everyone's happy, yeah, everything's running. Well, two weeks later, fire drill. 57 applications upgraded, but three didn't.

    The three who didn't were now corrupting the database. Because how they had the message contract defined was pretty tight. The data contract was a dataset of this one call and they were using the ordinals differently; that's the number, the index on there, so one went to seven or something like that. So they were completely putting the wrong data in the wrong columns.
    Ron: This, my friends, is why I don't trust datasets. Especially when you do things like by ordinal. Ooh, that's creepy, oh.
    Mark: At the end of the day, when you define an interface, you've got to pay for it somewhere.
    Ron: Yes, right.
    Mark: Someone's got to know that. And by putting it off in the dataset, while it offers you a lot of flexibility; it doesn't provide transparency. So you pay for it one way or the other, and we paid hard on this case. Remember, now, for two weeks you've got data being corrupted, and no one knows, first of all, who the applications are. Of course, fire drill, they're like, who's doing it. Not only who's doing it, but what are they doing.

    What do they do? The first thing: They try to figure it out themselves. They had this entire system that they're built on top of, and they decide not to leverage it. So, they go back into their database, they're looking at things and going nuts.

    Finally they realize, hey, let's call the Alchemy guys. Within five minutes of calling us, we told them the three applications, we told them every user who's done it, and we told them every call and when it was made.
    Ron: That's because you have this log of all the activity on their server.
    Mark: Right. Every call that's made we actually have in our queue as well as in our data warehouse. So, in our queue we can actually tell you the transaction ID that we correlate with it when it actually occurred. Without a shadow of a doubt we know exactly what went on.

    Why? Because it was throwing exceptions. It was like, "Oh yeah, we've got your exceptions right here, we know the applications that are doing it." The worst part about it, they were being notified by this. [laughter]
    Ron: And they were ignoring their notifications.
    Mark: So this is what happens when you also do a staff change. New staff, all these other things. It was just pretty much the worst-case scenario you could imagine for upgrade. Anyway, so now they're like, "We see the transactions, now we want to roll back the data,' because we know who did it, we know when they did it, and we want to know what they did.

    Luckily, we have a data warehouse, but unfortunately our data warehouse has to be archived every half day, because we produce over ten gigabytes of data every half day. So, it was pretty extensive, we actually had to roll back.

    Since we knew the date range, we had to go back to the tapes, replay the tapes, get off the information, and then pull out all the transactions, which wasn't too hard to.

    Then we actually showed them the requests. We showed them the requests before, so we showed them the data we got back, the data that they changed, and the exception that came back. So, they were very quickly able to solve the problem.
    Ron: Yikes.
    Mark: So what did we learn from that, though? We learned a great thing about upgrades. We learned very quickly that services become highly dependent upon data. By doing this we now have a new check inside of our runtime that says, "Hey, guess what, if I have a breaking change," because this was a breaking change, they switch ordinals, "and it's not reflected anywhere else, I can make sure people comply."

    Now when I click breaking change, we can actually go through the analytics to make sure everyone hit the testing environment, and if they were throwing exceptions or not.

    If we see too many exceptions, they also basically haven't gone through the validation process. So, if they meet the threshold for whatever the account is, established by the provider, and the provider also goes through and does the visual verification and then ticks them off, then they're ready to go.

    So, what would happen in the future, if 57 of those applications had been approved, once that thing hits production, the three that didn't get turned off? They can no longer make calls to that service.

    Like I said, there's no way we could have planned that the data contract and message contract and the policy were not enough.
    Ron: But, see, here's the thing. I have been saying this for a long time now, because I'm always suspicious of data sets or what I call the loosey-goosey anti-pattern. It means that the data cannot be validated syntactically alone.

    A data contract and the message contract, if it changes that's a syntactic change in the messages. But if you have something that's an opaque data structure like a data set, at the syntax level it just says, "Oh, it could be anything." It's like an XSD "any" tag, right?

    So, it can't be validated syntactically, which means you have to rely on semantic validation like, "The ordinal at this one should be this, and this ordinal should be that." Semantic validation is so easy to mess up. It allows the human element to totally screw it up, which it did in this case.
    Mark: This is why modeling is a good thing that we put in as a capability. Not only can you look at the three primaries, but under versioning now we also do a service model.

    We talk about dependencies in the system, so we actually have this thing called Service Modeling Language, SML, which is a rename of what we used to call System Definition Model, which is how we actually model systems at Microsoft.

    We are actually making it so that you can model that environment. We try to put more transparency into what the dependencies are of that service. And so if something does a major change, we're actually able to flag it so that we know there is another breaking change and where that breaking change occurs.
    Ron: My point is, I don't want to have to rely on a human to say, "This is a breaking change." What I want to do is design my contracts so that if I change them, automated systems can tell there's a change.
    Mark: We can do that for the three, but I wish it was easy, and unfortunately it's not. There's a reason for that, and that's because--remember, we said those fifteen ZIP code systems, right--well, guess what, each one has different data. Maybe I put in postcode, maybe I say ZIP code. Even though the data contract looks exactly the same, the data in it is different.

    It's those discrepancies in data that unfortunately only a human can make semantically meaningful. We're not to the point where cyborg computers are able to rationalize.
    Ron: But I think certain contracts are more vulnerable to those kinds of problems than others.
    Mark: Data sets.
    Ron: Yes. The other thing that makes me wonder about this is, if I have a service that's getting semantically incorrect data, it's throwing exceptions, how am I still managing to corrupt the database and updating data when I have all these things that are going wrong in the process? It seems to me I want to code more defensively than that.

    If my code is throwing exceptions about something that's going on, I'm not going to update the database. I'm going to make this exception so painful that somebody's going to have to notice it and you're not going to ignore it for two weeks.
    Mark: [laughter] Yes, like I said, it's ultimately the failure happens to be a human failure. The notifications were hitting there and the people were seeing what was going on, but they just weren't addressing it. Either they get complacent, or they just get flooded with exceptions that they just start ignoring them. "Oh yes, I checked the last time, it wasn't that big of a deal."

    So, there's always a missing link. With respect to, how do I put that in my code, how do I become more efficient? We're getting better with our tooling; we're making it so that people can focus on the business problem. Our prime example in.Net 3.0 and even 2.0, we have this new thing called end-to-end tracing. Are you familiar with that?
    Ron: Yes.
    Mark: For the audience, it's this fantastic way of actually going in and doing an inspection into any.Net 2.0 type of code or higher. So that you can actually start to see what's going on on a stack basis; go into all these different modules and say, "Oh, yeah. I've got a service that calls this block of logic right here, who calls this database, and guess what, the trigger's now blown on that database."

    Well, instead of me having to go through and put all these quality gaps in each of these different ones, it just bubbles up. I'm actually able to capture it, see what's going on, and provide that as root cause analysis. Very powerful. Newer technology, but...
    Ron: See, this is the interesting thing I think about if you guys do code blocks of this. To see the implementation of this in WCF would be really cool. I think it has a lot of things that are going to make it much easier this time around and so forth.
    Mark: The runtime perspective, the Alchemy, the overarching architecture, has got a lot of parts to it. We've got a repository, we've got our analytics, we've got our runtime, and we've got the infrastructure that we build upon.

    The runtime component itself, it would be kind of a mix. What we'd like to do is couple it with System Center. They've got a whole new series of things that are coming out, specifically for web servers' management.

    I've actually talked Dejanje to basically their web services guru over there. We're talking about if we do reference architecture for SOA inside of architecture strategy, to actually incorporate System Center to show how they can actually do that using a product, so they don't have to write their own.
    Ron: Beautiful. OK, we are out of time, way out of time, but this was really great, Mark. Thanks so much for joining me today.
    Mark: Thank you.
    Ron: Mark Baciak with some great thoughts on service-oriented infrastructure and the history of Alchemy and reference architecture and whatnot.

    I know that these are interesting things because they're the kind of things you have to think about once you get beyond the simple web service. You know, like "Hey, here's my web service. Add a web reference, start calling it, woo-hoo. We're done. That's service-oriented architecture."

    Well, that's just the beginning, that's baby steps. If we're going to do this on an enterprise-wide scale, you're really going to need more and so much more.

    In fact, I've got another very cool episode of ARCast, just recorded the other day, which is going to have a lot more to say about this. But that's coming in the future.

    I'll hope you'll keep listening and send me a note to

    We'll see you next time on ARCast.
    Announcer: ARCast radio is a production of the Microsoft Architecture Strategies Team,

Add Your 2 Cents