Tech Off Thread

23 posts

File associations and unknown file extensions

Back to Forum: Tech Off
  • MrMilney

    I've stumbled across some interesting behavior in Windows (and Word) that I would love to get some detailed information on. Try this experiment:

    1. Open a new, blank document in Word
    2. Type in some text, then save the file as test.doc
    3. Close Word
    4. Rename the file, changing the file extension to something that is not registered to any known application (for example, test.doc -> test.foo)
    5. Double-click the file

    Word launches and opens the file just like it was still named test.doc.

    Now, try this:

    1. Open a new, blank document in Notepad
    2. Type in some text, then save the file as test2.txt
    3. Close Notepad
    4. Rename the file using the same extension you did a moment ago (for example, test2.txt -> test2.foo
    5. Double-click the test2.foo file

    Windows launches a dialog asking you choose a program to use to open the unknown file type.

    Obviously there is some metadata in the Word doc that Windows can use to determine that the file should be opened in Word, but can anyone shed some more light on this? I'd be curious to know any technical details that you may be able to share.

  • cheong

    I believe Alternative Data Stream(ADS) could have involved, but not sure because no MS Office is installed in my home machine.

    But you can use the method provided in the KB article to dump the ADS out and see if any relevent bits are there.

    Recent Achievement unlocked: Code Avenger Tier 4/6: You see dead program. A lot!
    Last modified
  • MrMilney

    Unfortunately, that doesn't seem to be it. I've opened the file in FlexHEX which is a streams-aware hex editor and there are no alternate data streams in the original Word doc (just to be safe, I created a text file with alternate streams and, sure enough, FlexHEX saw and displayed them).

    Thanks for the input though, anyone else want to take a stab at this?

  • Cannot​Resolve​Symbol

    MrMilney wrote:
    Unfortunately, that doesn't seem to be it. I've opened the file in FlexHEX which is a streams-aware hex editor and there are no alternate data streams in the original Word doc (just to be safe, I created a text file with alternate streams and, sure enough, FlexHEX saw and displayed them).

    Thanks for the input though, anyone else want to take a stab at this?


    It's definitely not anything having to do with NTFS...  I can copy a Word document over to a FAT32 drive and the behavior persists.

    From what I can tell, Windows is identifying the file type from its content.  If you write over it (i.e. echo "Hi" > test.filetype), you get the "Open with" box again.  HTML documents created by Word operate the same way:  if you delete one particular line, it loses its magic Word icon.

  • Sven Groot

    CannotResolveSymbol wrote:
    HTML documents created by Word operate the same way:  if you delete one particular line, it loses its magic Word icon.

    But that works because Office installs a shell icon handler for html files; it does not work on arbitrary file types.

  • stevo_

    I think this is just as simple as programs can install filters into the shell (or the shell file assoc lets you associate with more than just extensions), the word documents will carry a standard header, ie- first 64 bytes = 'this' means this is very likely to be a word document. It would be nice if files moved away from being extension driven. I have had clients using Macintosh/PC environments that have uploaded files to our web administration consoles, only to be told the file extension couldn't be found. We spent quite a lot of time working up potential byte sniffers that would be used to identify patterns (such as 'header regions'), but the more we started to plan out an idea, the more work it appeared to be, for a service that really could not provide truely reliable feedback.

  • RichardRudek

     I did a quick trace and found that Explorer was Opening the FileExts registry, whish lead to [this document] on MS.

    As for how an entry for .foo appeared there, I would need to look into further. But I suspect that happens when you do the file rename in (Windows) Explorer.

  • MrMilney

    CannotResolveSymbol wrote:
    
    if you delete one particular line, it loses its magic Word icon.

    You're right, if you save a Word doc as a web page then remove the line that reads

    <meta name=Generator content="Microsoft Word 12 (filtered)">

    the file magically becomes a "plain" HTML document and all association with Word is gone (it loses the special icon and it no longer opens for editing inside Word). Put that line back, and it works again. Obviously, something similar is happening in the standard Word doc file too.

    I ran into this when I was giving a demonstration to my students last week about how Windows uses the file extension to know what the file's type is. When I demonstrated using a Word doc, it all fell apart: no matter what I did, the file opened in Word! Other file types (.txt, *jpg) worked as I expected them to, but Word kept a hold on any file that it created. It was an ... interesting ... experience.

  • MrMilney

    stevo_ wrote:
    I think this is just as simple as programs can install filters into the shell (or the shell file assoc lets you associate with more than just extensions), the word documents will carry a standard header, ie- first 64 bytes = 'this' means this is very likely to be a word document.

    Agreed, there's something in the file itself, but not something that is immediately obvious. I poked around inside the file with a hex editor and nothing jumped out at me, in any case. I would have like to be able to go back to class and show my students some string or obvious piece of metadata that identified this file as a Word doc regardless of the the extension.

    stevo_ wrote:
    It would be nice if files moved away from being extension driven. I have had clients using Macintosh/PC environments that have uploaded files to our web administration consoles, only to be told the file extension couldn't be found.

    I agree, the three letter extension dependency is fragile at best, not to mention so very last century. It's the price we pay for backward compatibility, for better or for worse.

  • MrMilney

    RichardRudek wrote:
     I did a quick trace and found that Explorer was Opening the FileExts registry, whish lead to [this document] on MS.

    As for how an entry for .foo appeared there, I would need to look into further. But I suspect that happens when you do the file rename in (Windows) Explorer.

    Thanks for the find, it was interesting reading. However, it didn't really get to the heart of my particular problem.

    Even though I used *.foo in my example above, that wasn't what I actually changed the extension to on the computer in my classroom (the file was named "directions to grandmas.doc" which I then changed to "directions to grandmas.house" trying to be clever. As we all know, Word still opened the file. So I threw the file on my thumb drive and brought it home to examine. When I double-click it at home, it still opened in Word even though the creation and renaming happened on a computer miles away. Just to be safe, I check the registry on my computer and there is no entry for *.house files (or *.foo for that matter) so it isn't the case that either Word or Windows installed a handler for the new file type when I changed the extension.

  • Cannot​Resolve​Symbol

    MrMilney wrote:
    

    stevo_ wrote: I think this is just as simple as programs can install filters into the shell (or the shell file assoc lets you associate with more than just extensions), the word documents will carry a standard header, ie- first 64 bytes = 'this' means this is very likely to be a word document.

    Agreed, there's something in the file itself, but not something that is immediately obvious. I poked around inside the file with a hex editor and nothing jumped out at me, in any case. I would have like to be able to go back to class and show my students some string or obvious piece of metadata that identified this file as a Word doc regardless of the the extension.



    I did some poking around and it appears that Word documents have a common header of (hex) d0 cf 11 e0 a1 b1 1a e1.  If any of these bytes are changed, Windows no longer recognizes the file as being a Word document.  I'm not sure how this is associated with Word in Windows, though, unless it's actually hard-coded into Windows itself.  Glancing through the registry doesn't seem to help.

    MrMilney wrote:

    stevo_ wrote: It would be nice if files moved away from being extension driven. I have had clients using Macintosh/PC environments that have uploaded files to our web administration consoles, only to be told the file extension couldn't be found.

    I agree, the three letter extension dependency is fragile at best, not to mention so very last century. It's the price we pay for backward compatibility, for better or for worse.



    Unfortunately, there's no other reliable way to keep file type data.  If you use alternate data streams to provide file type data like Mac OS used to in its resource fork, you'll lose the file type every time you put the file on a volume using a filesystem like FAT that doesn't support ADS's.  And if you try to guess the file type from the data contained in the file, you can only guess:  there's nothing stopping a program I create from generating a binary file beginning with d0 cf 11 e0 a1 b1 1a e1, for example.

  • RichardRudek

    MrMilney wrote:
    

    Thanks for the find, it was interesting reading. However, it didn't really get to the heart of my particular problem.

    Even though I used *.foo in my example above, that wasn't what I actually changed the extension to on the computer in my classroom (the file was named "directions to grandmas.doc" which I then changed to "directions to grandmas.house" trying to be clever. As we all know, Word still opened the file. So I threw the file on my thumb drive and brought it home to examine. When I double-click it at home, it still opened in Word even though the creation and renaming happened on a computer miles away. Just to be safe, I check the registry on my computer and there is no entry for *.house files (or *.foo for that matter) so it isn't the case that either Word or Windows installed a handler for the new file type when I changed the extension.

    OK, I didn't look far enough into the trace - I stopped when I saw the FileExts registry lookup, and when I checked my registry, it had Word setup as the handler for it. I then searched MS and found the document that I linked.

    Upon further examination (10 mins), I see that Explorer does open the file reading in the first 512 bytes. I'm certain .foo wasn't registered beforehand, in which case something set it, possibly the code that follows the file contents examination. Here's an excerpt:

    Process		Process ID	Win32 API		Parameters								Return Value	Status		GetLastError

    explorer 0x5B0 GetFileAttributesW lpFileName:0xEBA1C "D:\My Documents\test.foo" 0x20 SUCCESS 0
    explorer 0x5B0 HeapAlloc hHeap:0x90000, dwFlags:0x8, dwBytes:0x30 0x1B20518 SUCCESS 0
    explorer 0x5B0 HeapFree hHeap:0x90000, dwFlags:0x0, lpMem:0x1B20518 0x17BF01 SUCCESS 0
    explorer 0x5B0 CreateFileW lpFileName:0xEBA1C "D:\My Documents\test.foo",
     dwDesiredAccess:0x80000100 = GENERIC_READ | SPECIFIC_RIGHTS_ALL,
     dwShareMode:0x3 = FILE_SHARE_WRITE | FILE_SHARE_READ,
     lpSecurityAttributes:0x0,
     dwCreationDisposition:0x3 = OPEN_EXISTING,
     dwFlagsAndAttributes:0x0,
     hTemplateFile:0x0 0x4FC SUCCESS 0
    explorer 0x5B0 HeapAlloc hHeap:0x90000, dwFlags:0x8, dwBytes:0x30 0x1B20518 SUCCESS 0
    explorer 0x5B0 HeapFree hHeap:0x90000, dwFlags:0x0, lpMem:0x1B20518 0x17BF01 SUCCESS 0
    explorer 0x5B0 SetFilePointer hFile:0x4FC,
     lDistanceToMove:0x0,
     lpDistanceToMoveHigh:0x0,
     dwMoveMethod:0x0 0x0 SUCCESS 0
    explorer 0x5B0 ReadFile hFile:0x4FC, lpBuffer:0x12CEB14 <D0CF11E0A1B11AE100000000000000>,
     nNumberOfBytesToRead:0x200,
     lpNumberOfBytesRead:0x12CEB08,
     lpOverlapped:0x0 0x1 SUCCESS 0

    Geez that was a PITA for try and format !!

    If your interested, I can give a copy of the trace to you. You'll need to download [APIMON], though.

  • Cannot​Resolve​Symbol

    Does Explorer read anything from the registry or from disk after the ReadFile but before it starts Word?  I'd try it myself but my school's filter blocks APIMonitor (it's "Hacking" Perplexed).

  • RichardRudek

    CannotResolveSymbol wrote:
    Does Explorer read anything from the registry or from disk after the ReadFile but before it starts Word?  I'd try it myself but my school's filter blocks APIMonitor (it's "Hacking" ).


    Oh sh!t yeah. That's why I'm happy to give away the trace...

    I'm suggesting people download APIMON just so they can open and view the trace, directly. I've printed it out (from APIMON) and converted to PDF, but all the colours are gone, and it looks like you loose some details. It's printing to a colour postscript driver set to write to file, and I normally have no problem when I print colour content like this, so it must be an APIMON limitation..

    Is the sandbox an appropriate place to put this stuff - I've not done this before. If I I have to register or something, then I'll try doing it tonight, if theres enough interest.

  • figuerres

    RichardRudek wrote:
    
            dwMoveMethod:0x0
    0x0 SUCCESS 0
    explorer 0x5B0 ReadFile hFile:0x4FC, lpBuffer:0x12CEB14 <D0CF11E0A1B11AE100000000000000>,
     nNumberOfBytesToRead:0x200,
     lpNumberOfBytesRead:0x12CEB08,
     lpOverlapped:0x0 0x1 SUCCESS 0




    Is it just me or is that:


    <D0C F11E 0A1B11AE100000000000000>,

    "DOCument File"  ??  Smiley

  • Cannot​Resolve​Symbol

    figuerres wrote:
    
    RichardRudek wrote: 
            dwMoveMethod:0x0
    0x0 SUCCESS 0
    explorer 0x5B0 ReadFile hFile:0x4FC, lpBuffer:0x12CEB14 <D0CF11E0A1B11AE100000000000000>,
     nNumberOfBytesToRead:0x200,
     lpNumberOfBytesRead:0x12CEB08,
     lpOverlapped:0x0 0x1 SUCCESS 0




    Is it just me or is that:


    <D0C F11E 0A1B11AE100000000000000>,

    "DOCument File"  ??  Smiley


    I didn't catch that when I first posted it, but I noticed it when Richard posted it again...  Leetspeak by the MS Office team Wink

  • benn23uk

    Presumably this also explains XML files. If you have an ordinary XML file from somewhere random, it'll probably open in IE, but if you have one that was created by saving an Excel spreadsheet as XML, then it opens in Excel again. Or at least does on my machine...

  • MrMilney

    RichardRudek wrote:
    

    Geez that was a PITA for try and format !!

    If your interested, I can give a copy of the trace to you. You'll need to download [APIMON], though.

    Thanks for all your time and effort!

    I would be very interested in a copy of the trace in whatever format you could muster.

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.