I am currently on a project where I have to generate a huge number of XML packets. I am currently using the XmlValidatingReader to validate the xml in text fragments. The problem is speed -- let me explain. I am validating up to 2 million fragments in
a loop at 1 time (yes). The fragments are never much over 2K in size, but since the reader doesn't allow (at least as far as I can tell) a way to clear the current fragment and load another, I have to create a new reader each pass thru the loop. This can't
be the optimal way to do it. Anyone have any ideas on validating huge amounts of xml in a loop that won't take me several days to finish? Oh, to make matters worse the fragments are all different types (12 total types of xml packets). I was really hoping to
do something like this -- but it don't work
XmlValidatingReader vr = new XmlValidatingReader(strXMLFragment, XmlNodeType.Document, context);
while (HIDEOUSLY_LONG_LOOP)
{
while (vr.Read()) {}
vr.Close();
strNextXMLFragment = GetNextXMLFragment();
vr.Open(strNextXMLFragment);
}
-
-
Aha! I knew my question would be too difficult for anyone (including you Microsofties) to answer! Why is it that every time I program a real world, production application that I run up against these real boundries that bite me in the rump? This app is in production and handles over 7 million voters in a particular state, they are demanding that I get this whole process running at a specific rate that is an order of magnitude higher than it is currently running, and that it must be done in .net. Geez! I tell you, sometimes us guys in the trenches just have to eat big spicy dirt balls all day. It enough to make a guy want to give up and become a wallmart greater. Oh, and before anyone suggests threads, the app is already fully threaded. Now, someone, anyone, must have run into this situation where they are validating HUGE amounts of XML fragments and they have to do it fast! Please, someone, throw me a bone!
-
What, no one is even going to attempt an answer!? I am surprised given the volume of folks inside the C# team that I know read this site! Come on guys/gals, you cannot tell me that no one has thought this thru!! Am I to believe that XML Schema validation is something that will take orders of magnitude longer to complete than the surrounding code? For instance, If my loop takes 10 cycles to run, the validation will add 100 or 1000 cycles or more!
-
I would guess the lack of response is directly related to the lack of detail in your question.
What does your architecture for this solution look like? It seems to me that if you are processing up to 2 million XML whatevers, you need to make sure that you are doing so in an asynchronous fashion in order to increase your ability to process the huge amount of data you're talking about, or does some condition exist that does not allow you to do so? How do the XML Fragments you speak of get created, by what, and where do they reside once created. What kind of processing response times are expected? What mechanism kicks off the XML Fragment Processor that you discuss above?
Can you give more details about what the system is doing and then I'm sure people in the forum can provide some guidance.
Thanks. -
That is fair enough. I will simplify the question. The XML is coming down an FTP pipe inside of files that each contain 100 xml fragments per file. I am grabbing each file as it becomes available and processing the entire file -- if they come in asynch, the get processed that way, everything is running on its own worker threads. I cannot control the feed of the data unless congress passes legislation
Now, the processing expected is on the order of 18,000 fragments per hour -- and I have identified that
within my system, the bottleneck occours when I validate a fragment -- period -- take the validation out of the loop and the system climbs to 300,000 records per hour! Put it in, the performance degrades to 20,000 an hour -- no other way to put it. I am using
the validator code that is all over the web, basically, load up a fragment and loop thru it, doesn't get much simpler, here is the code. The problem is that I have to destroy the reader for every packet because there is no way to reload a packet into the reader.
That was my original question. Why is the XSD validator so gosh darned slow. I have to validate each and every packet I queue up to send and I have to send them in order to the state, in other words, if I send one up with a transaction that is lower than one
already sent, they ignore it (short sighted on their part I know), this precludes me from being able to easily process outbound packets in like a huge thread pool and send them up -- so I gotta chew thru the list in order and send them up in order, This greatly
limits my speed abilities. But, once again, I STRESS, remove the validation from the loop and the performance climbs to over 300,000 records per hour!!!!! Put it back in and you crash down to 20,000!!!!
//Create the XmlParserContext.
XmlParserContext context = new XmlParserContext(null, null, null, XmlSpace.None);
XmlValidatingReader vr = new XmlValidatingReader(p_XML, XmlNodeType.Document, context); // Create a validating reader to validate XML file.
vr.ValidationEventHandler += new ValidationEventHandler(XmlValidationError); // Associate an event handler to trap Validation errors.
vr.ValidationType = ValidationType.Schema; // Validate schema type.
vr.Schemas.Add("",mMainForm.mXMLSchema); // Load XML schema file.
while (vr.Read()) {} // Validate the XML file.
vr.Close(); // Close the validating reader
bRet = true;
-
If you are loading the Schema each time then you are probably taking a nice performance hit on that operation.
To start, I'd see if something like this would help (replace where you are loading the Schema). You're loading your 9 schemas into memory only once and then using them from memory instead of loading for every single fragment:
XmlSchemaCollection schemaCache =
new XmlSchemaCollection();
schemaCache.Add("urn:voting-type1-schema", "Vote1.xsd");
schemaCache.Add("urn:voting-type2-schema", "Vote2.xsd");
//Load more of your schemas...
//
vr.Schemas.Add(schemaCache); -
I doubt that would work. I don't know ahead of time what type of xml fragment is in the file, it could be 1 of 12 different fragment types. Also, the provided XSD is a monolithic one, consisting of all 12 fragments. So, the way I see it, the only way is to load up the XSD each and every time I load up a fragment. Which, as I see it, is the problem, why can I not just slide in a new fragment on the existing reader and schema ... So, my pseudo code loop looks like this. Also, the state doesn't send the xml in the traditional sense, each file contains 100 fragments each fragment is seperated with an xml prefix like this, so I have to break this into xml fragments when I start.
<?xml version="1.0" encoding="UTF-8"?>
<sos_ack><tran_id>10304061500002955</tran_id><tran_status>N</tran_status><status_message>2402 Can not add district (03-AAO) because it already exists</status_message></sos_ack>
<?xml version="1.0" encoding="UTF-8"?>
<sos_ack><tran_id>10304061500002956</tran_id><tran_status>N</tran_status><status_message>2402 Can not add district (03-ACR) because it already exists</status_message></sos_ack>
<?xml version="1.0" encoding="UTF-8"?>
<sos_ack><tran_id>10304061500002957</tran_id><tran_status>N</tran_status><status_message>2402 Can not add district (03-AAH) because it already exists</status_message></sos_ack>
Is There A File?
Yes - Open It
For Each XML Fragment In File
Load Fragment
Load Schema
Validate Schema
Get Next Fragment
Continue While There Are Files -
hi rmessier
i'm not quite sure why u can't load the 12 schemas once and use them for all subsequent validation? would that work maybe?
The load time for the xmlschemas could be significant certainly imo also...it might be worth measuring anyway...
-
skibum, you had the right idea. Putting the schema into the schema collection made a vast speed difference in the overall performance. Even though I still have to create a validating reader each time thru the loop, the performance hit is no where near as bad as when I was reading the schema in each and every time. Thanks for the nudge into the correct direction.
I still wish, though (listening C# XML team??) that the reader had a way of loading a new XML fragment instead of having to create it each time. How about adding a Clear() and Load() method to the Reader?! -
I'm glad it improved your results. Please post what kind of increase you are seeing...I'd be interested.
If you feel that the feature you want added to the XmlValidatingReader is something that should be considered by the Microsoft team, you should create a post on the MSDN "LadyBug" system. I saw some recent stats that they are adding some requested features to the 2.0 framework based on suggestions.
Thread Closed
This thread is kinda stale and has been closed but if you'd like to continue the conversation, please create a new thread in our Forums,
or Contact Us and let us know.