No more so than rotuing all load balanced requests via a single hardware loadbalancer, switch, or other single resource such as backend functional database.
Redundancy is key, just as would be for the backend functional database (StockTraderDB); you would, for highest reliability, want to use Windows Clustering Services (as you might also for the transacted msmq previously discussed)
to ensure you have at least one backup node for the SQL repository database configured/ready to come online should the primary node fail; as well as backup power supply; databases using RAID arrays with a mirrored (RAID 10 for example) format, etc.
Hello again. Here is some more info on your questions, at least with respect to the StockTrader app as I hesitate to position myself as an expert on MSMQ details, and obviously you have lots of good experience here. I think an interesting follow up would
be to get a new tech whitepaper, in conjunction with the MSMQ/enterprise services team, on the different approaches to load balancing and failover with MSMQ, especially wrt to Windows Server 2008 and MSMQ 4.0, which is the product team's current focus--along
with WCF-specific content. I will be investigating this shortly as I know their knowledge is much deeper than mine.
"So, if I were to duplicate this without WCF, there would be code that essentially would select a different queue path depending for each request, thereby kinda load balancing. Fair enough, but how does this resolve the situation where the path you select
points to a machine that is down?"
Yes. The key here is that the fault tolereance is built into the WCF client, in custom code heavily tied to the WCF-based configuration service I built into the app. It is not perfect, and I plan to spend some time on this part in future releases as well.
But the client uses a WCF ChannelFactory to create a channel to the MSMQ endpoint. With the Config Service I did, mutliple endpoints are managed in a SQL respository, and their is notification between service hosts and service clients about new available nodes
as they come online. If a node goes down, the client logic will loop through other online processing nodes until success delivery of the message. It will eventually abort after a settable number of retries, throwing an exception back to the customer client
code that the endpoint is not reachable on any node, if no nodes are actually online.
When a node cannot be reached for a consecutive number of contigous requests, a client automatically removes it from it's local load balanced list. When a service host is restarted (or a new one started), it will re-notify clients of this fact and will be
added back into their load balanced list. The logic is not perfect as a 1.0 sample; again want to spend time here in next update. But once I decided to store config settings in a central SQL repository for each service, I decided to also make this repository
a 'clustering config center' of sorts; keeping track of new service hosts that come online automatically based on the simple notification service I did (also WCF based).
For example, the Web clients will also load balance against multiple Business Service hosts (not MSMQ based; rather TCP or HTTP-based) using same mechanism. It can also load balance against multiple copies of WebSphere running the Business Services as J2EE-based
Web Services; although since Trade 6.1 does not implement the Config Service I did, auto-notification does not occurr. Another thing (besides some cleanup of the failover logic) that needs to go into this sample is some polling logic so if the network goes
down (but service host never stops running) then is reconnected, clients automatically understand this as well and start using the service host node again. This should be fairly straightforward to add.
"Due to the nature of MSMQ, a local message will be queued and will wait in the outgoing queue until the target machine comes back online. This provides resiliency, but not really fault tolerance and failover, because now I have an orphaned message that
may be stuck in the outgoing queue forever if the target machine never "comes back.
With a transacted queue, of course, the message is persisted to disk. Messages will stay there, as you point out, until the processing logic, wherever it lives, comes back online. I believe the general approach for failover is to use Windows Clustering Services
(not Windows Load Balancing) that provided the ability to have many machines, attached to shared RAID arrays, act as active/passive backups (up to 8, I believe more available with Windows Server 2008). For assured message delivery, any node hosting a persisted
queue, including forward queus, would need to be clustered in such a fashion. This clustering ensures your processing machine, attached to the storage where the message is persisted, can go down but another node come online automatically to pick up processing.
The same technology is used for MS EXchange and SQL Server clustering. It is not load-balancing ala Windows Network Load Balancing.
"If so, I still am confused as to how this is any different than before. You still have the problem of ensuring the the polling mechanism is only sending to servers that "up". Since MSMQ naturally abstracts this entire process away from you, you're still
stuck with the "orphaned message" scenario if the target machine never comes back and the message is sitting in the outgoing queue."
In StockTrader, multiple endpoints are automatically managed by the config service, so, yes, the computer running the central queue with the Order Processor set in "FORWARD" mode would be deployed on
an active/passive cluster with Windows Clustering (Enterprise Edition of Windows Server) with up to 8 backup/failover nodes agaoinst the single MSMQ transacted queue. The FORWARD node of the OrderProcessor service host will do a distributed tx read from the
local queue with the second part of the transaction doing a write to one or many load balanced nodes that will actually do the processing of the message against the database. hence, this central machine only does read/writes, no actually processing; and can
be clustered for failover. If the write fails to the targeted (load balanced) PROCESSING node, then the message is attempted to be delivered to another PROCESSING node if one exists (as active in the repository/config system). If it cannot be delivered,
it would remain in the central queue and re-tried until a PROCESSING node comes online.
On the PROCESSING side, the WCF service host uses WCF's Poison Message event to detect a posion message that cannot be processed against the DB, and ships this to a posion message (offline) queue : perhaps for audit/workflow fixup although this part is not
implemented. But it must be removed from the active queue, this is settable in the binding config for the WCF MSMQ binding.
One other thing that is used for perf reasons is WCF's facility to batch transactions, processing multiple messages from the queue as part of a larger atomic transaction, to reduce the number of distributed tx's which are expensive. This must be balanced against
concurrency issues, I am batching 5 messages at a time on the processing side; this is also set via the WCF MSMQ binding config (XML config files for .NET) and could be set in code as well with WCF.
Hope this info helps, again, this is my approach I took, would be interested in your feedback and ideas for better approaches as well.
The load balancing of MSMQ is performed by the WCF client, which will round-robin against servers that are running the OrderProcessor service host. This is built into the StockTrader AsyncOrderClient itself.
With WCF, actually, when using an MSMQ binding, the client app is not directly communicating with the OrderProcessor.exe service host; rather it is talking to MSMQ.
A Windows Service, installed and active with .NET 3.0 after you install .NET 3.0, runs in background and automatically routes messages to the WCF service operation (method) when they are present in the MSMQ queue. Hence, the OrderProcessor service host process
does not need to be active for client to submit, although orders will not be processed unless it is.
With MSMQ 3.5, you are right in that you cannot do a remote read as part of a distributed tx (you can do a remote write as part of a distributed tx). This limitation is removed with MSMQ 4.0, which is part of Vista and Windows Server 2008 (beta still).
There is a fairly well-known work around for the remote read/distributed tx issue with MSMQ 3.5, which calls for creation of a polling mechanism that essentially transfers the message to the local processing computers local MSMQ as part of a distributed tx;
which then reads them locally and processes as part of another distributed tx.
However, the WCF MSMQ binding is nicer; in that you do not need to create any MSMQ polling logic given way WCF background service automatically routes messages to the active service host process. For StockTrader, I included a "FORWARD" mode (set as a config
option in the Order Processor self-host program) that enables the OrderProcessor service host on a central MSMQ machine to 'forward' orders to one or more active instances of the same order processor service host (running in STANDARD mode) on remote server.
Since all reads are then against a local MSMQ, all are part of a distributed tx. The Order Processor service running in FORWARD mode does not need database connectivity--it simply gets the messages and ships them to the order processor instance(s) running
in Standard mode on other machines, which then processes them against the DB.
With Windows Server 2008, this would not be necessary and remote reads can be part of a distributed tx coordinated by the MS DTC.
Also, MSMQ will work with Windows Network Load Balancing; there are some good articles on MSDN about this. however, by default you would get unequal load distributed becuase of nature of TCP persistent connections. Typically I see all load go to one server
for about 1 minute, then all load goto another server for a minute, etc. etc. But I believe the articles on MSDN have some advice on how to reconfigure to fix this but I have not personally tried this. Another good article on windows network load balancing
and WCF services is:
But StockTrader is using a client-driven app level load balacning (simple round robin) built into the app itself, not windows network load balancing by default. Simply start multiple service hosts (Order processor service); point them at same config repository,
and clients get notification and will start load balancing against them.