Tech Off Thread

12 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

Tough SQL Server XML Challenge

Back to Forum: Tech Off
  • User profile image
    rmessier

    Okay, I have a particulary ugly block of xml that I need to parse out and stick into sql server.
    The challenge is getting this thing into the correct format since it has both element and attribute
    peppered in it. I was just going to use a dataset and read the xml into it. But there are some problems
    with that because the repeating <geo_co> tag in mail and residence address blocks gum up the works.
    Anyone point me in the right direction? I have started to sort it out, but the voting regions return a
    very strange row, this one has me baffled, I spent the better part of a day futzing with it. Any help would
    be appreciated...


    ------------[ Cut Here ]--------------------
    BEGIN
    DECLARE @IDOC INT
    DECLARE @DOC VARCHAR(4000)

    SET @doc = '<?xml version="1.0" encoding="UTF-8"?>
     <boe_vr_init>
     <tran_id>10703111500002423</tran_id>
     <boe_server_id>192.168.1.1</boe_server_id>
     <boe_voter_id>234123523</boe_voter_id>
     <voter>
     <lastname>Gravenstein</lastname>
     <firstname>Rumpi</firstname>
     <middlename>Ohiostate</middlename>
     <name_suffix>Jr.</name_suffix>
     <birth_da>1967-08-13</birth_da>
     <registration_da>2001-08-13</registration_da>
     <registration_type>MAILIN</registration_type>
     <voter_status>ACTIVE</voter_status>
     <private_address_fl>N</private_address_fl>
     <bmv_id>AD132455</bmv_id>
     <ssn>300994430</ssn>
     <ssn4>4430</ssn4>
     <phone>(000) 000-0000</phone>
     <birth_city>Akron</birth_city>
     <birth_state>Ohio</birth_state>
     <birth_country>USA</birth_country>
     </voter>
     <party_affiliation>R</party_affiliation>
     <residence_address>
     <house_number>2020</house_number>
     <secondary_indicator/>
     <street_name>Front</street_name>
     <street_description>ST</street_description>
     <pre_street_direction/>
     <post_street_direction/>
     <secondary_address>Falls Towne Center, Suite 104-8</secondary_address>
     <city>Cuyahoga Falls</city>
     <state>OHIO</state>
     <zip_co>44221</zip_co>
     <geo_co>
     <latitude>3.1415926535897932384626433832795</latitude>
     <longitude>3.1415926535897932384626433832795</longitude>
     </geo_co>
     </residence_address>
     <mail_address>
     <street_name>MAIN</street_name>
     <city>AKRON</city>
     <state>OHIO</state>
     <zip_co>44221</zip_co>
     <geo_co>
     <latitude>3.1415926535897932384626433832795</latitude>
     <longitude>3.1415926535897932384626433832795</longitude>
     </geo_co>
     </mail_address>
     <voting_regions>
     <district_code district_type="CONGDIST">23</district_code>
     <district_code district_type="STATHOUS">34</district_code>
     <district_code district_type="STATSENA">12</district_code>
     </voting_regions>
     <edit_flag>Y</edit_flag>
     </boe_vr_init>
    '


     EXEC sp_xml_preparedocument @IDOC OUTPUT, @DOC
     SELECT *
      FROM OPENXML(@idoc,'/')
      WITH (
        tran_id   varchar(020) '/boe_vr_init'
       ,district_code varchar(20)  '/boe_vr_init/voting_regions'
       ,district_type varchar(20)  '/boe_vr_init/voting_regions/district_type'
       ,latitude  varchar(20)  '/boe_vr_init/mail_address/geo_co'
       ,longitude  varchar(20)  '/boe_vr_init/mail_address/geo_co'
      )
     EXEC sp_xml_removedocument @IDOC
     
    END
    GO

  • User profile image
    rmessier

    I was afraid that this would stump everyone. Here is the situation in case anyone replies with an overly simplistic answer. Unfortunately, this XML is being genarated by a state voting system so I have no way of asking them to modify it, what you see if what I am getting and what I am expected to interface to. At this point I am thinking that I am going to have to parse this mess out line by line without being able to use an XML document to parse it. If anyone has any brilliant advice I'd love to hear it. But, so far DataSet.ReadXML() is failing as well as OPENXML. So, I am faced with quiet the conundrum. What have otherse in this situation done? You are faced with a butchered XML file that the provider will not or cannot modify, that is close enough to real XML to be flustering but not valid enough to be read as valid XML. ideas people? Or is this one just too tough for everyone? 

  • User profile image
    jj5

    Erm, who is Rumpi Gravenstein and how do they feel about you posting this information to this forum?

    John.

  • User profile image
    rmessier

    Rumpi is a test name, all the data is for testing alone and is a publicly published standard for the state that is requiring this interface.   

  • User profile image
    UdoSchroeter

    I'm not sure if I got that right, but here
      ,latitude  varchar(20)  '/boe_vr_init/mail_address/geo_co'
      ,longitude  varchar(20)  '/boe_vr_init/mail_address/geo_co'

    you're actually selecting the XML node geo_co instead of the latitude and longitude?

    So, setting the correct XPath to
      /boe_vr_init/mail_address/geo_co/latitude
      /boe_vr_init/mail_address/geo_co/longitude
    respectively might help.

    On the other hand, I don't know * about SQL Server so, this post may be a troll. Anyway, just thinking...

    Update: Oops, my bad. Forget this post.

  • User profile image
    hurcane

    I can't help much with the XML, but it's interesting to see test data that isn't all fake. The addresses listed in Cuyahoga Falls is legitimate. I know because I've been there (not to the suite, but to the office building).

  • User profile image
    rmessier

    Indeed! The address exists, that is interesting. The specification is public knowledge so I doubt that the publisher of it would put sensitive data into it. Just a short background. This is a voter registration system that is being handled by a mid western state. They require that the counties within that state interface to their main voter registration system. The rub is that the state's system is Oracle (YUK!) and the XML that they give you is butt ugly. The communication layer between the state and the counties is all done via XML packets. The state sends a county a packet and you respond back with a  packet. What is ugly about it is that withing the spec the Oracle produced XML does not conform to the common XML specs. For instance, if they send more than 1 voter in an XML file it will actually look like this!!!!!  YES, I am not lying! The actually will place a new <?xml> tag at the top of each new row in the file instead of wrapping it in a root tag. For the life of me I cannot figure out why they (or Oracle) would do this, it boggles the mind.

    -----------[ XML voter registration packet ]--------
    <?xml version="1.0" encoding="UTF-8"?>
     <boe_vr_init>
     <tran_id>10703111500002423</tran_id>
     <boe_server_id>192.168.1.1</boe_server_id>
     <boe_voter_id>234123523</boe_voter_id>
     <voter>
     <lastname>Gravenstein</lastname>
     <firstname>Rumpi</firstname>
    </boe_vr_init>
    <?xml version="1.0" encoding="UTF-8"?>
     <boe_vr_init>
     <tran_id>10703111500002424</tran_id>
     <boe_server_id>192.168.1.1</boe_server_id>
     <boe_voter_id>234123524</boe_voter_id>
     <voter>
     <lastname>Messier</lastname>
     <firstname>Ray</firstname>
    </boe_vr_init>
    ------------[ end of line ]----------

  • User profile image
    spod

    It sounds pretty nasty. i'm not sure if any of our xml stack will cope with the multiple <?xml tags.

    Can you process it in the middle tier maybe?
    Use string / regex stuff to pull out the fragments and then process them into some sql friendly strucure ( generate a batch of inserts maybe? ) using an xmlreader over the fragments?


  • User profile image
    gmiley

    It could get kinda chunky depending upon the filesize, but you could store the data as a string and

    string.split("<?xml version=""1.0"" encoding=""UTF-8""?>")

    Or string.replace("<?xml version=""1.0"" encoding=""UTF-8""?>" & vbCrLf, "")

    That is, if you know it is coming from a windows machine. Otherwise, you would need to find out how they do their CRs or just ommit the & vbCrLf since when you read in the xml it doesnt care about white space.

  • User profile image
    siri

    My experience with XSLT is that if you can understand it and it is at least Formatted correctly (tag hierarchy), you can write a declaritive StyleSheet to transform it into just about anything you want.  Depending on the complexity, this might not be better than a procedural analysis of the XML.

    Of course providing multiple voters as multiple XML blocks is totally stupid, and requires that you write at least SOME procedural code to break it up.

    I would suggest creating a format which is your objective (ie. one that will work), and then it should be pretty clear what kind of transformation needs to take place.  If it is not complicated, then it should be pretty easy to write an XSLT to take care of it, especially if you just need to change a couple of tag names.

    Regardless, it sounds like you won't be able to get away with directly importing it into SQL Server.

    siri

    ps. I didn't actually have time to examine the structure of the XML.  If you'd like me to make a more specific suggestion, let me know that you are interested in this kind of solution and what kind of changes need to take place.

  • User profile image
    siri

    That's odd.  I don't see anything wrong with the XML you posted.  Sure 'geo_co' is used twice, but that should be fine.  It means the same thing in both cases, simply with a different context.  I have little experience dealing with XML in SQL Server, but I'd have to say that if it can't parse this it is SQL Server's problem...

    Actually, I take that back...I could see a problem with importing this directly in that this XML block is really several table joined together and then seperated structurally in the XML.  Still...SQL Server should be able to parse this!

  • User profile image
    onovotny

    Forgive me if this is a dumb idea, but does the state have any published schema for this document -- can you just get the XSD and use that to generate appropriate classes to read the data?

    Alternatively, you can try using the XSD.exe tool in the .NET framework to have it guess at a schema (the exact parameter escapes me ATM, but it's there). 

    Since it looks like you'll need to do some middle-tier processing anyway (to break-up the multiple documents in the stream), can you load up each document into an XPathDocument and parse out the needed info?  You could then use your favorite ADO.Net technique to load the data...

    Regards,
    --Oren

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.