Coffeehouse Thread

85 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

It is time... Move the filesystem off of disks

Back to Forum: Coffeehouse
  • User profile image
    ManipUni

    Most new computers today are sold with between 4 - 6 GB and that is set to rise. Yet most filesystems continue to store their databases on the physical drives be it SSD or Hard Disks. They have very clever caching but yet still more often than not when a file's meta data is queried the drive has to be powered and we have to wait several milliseconds.

     

    Yes, your NTFS database CAN reach up to 1 GB on a really large modern drive with ACL permissions set all over the place, but so what? When you have several GB of memory free kicking around would you not give that up for far more time on your laptop and quicker responses across the board?

     

    And before you say a word about data integrity, we already have transactional filesystems to solve that problem...

     

    PS - Keep in mind *DATA* and the filesystem database are distinctly different things (with larger files the database actually gets smaller)

  • User profile image
    Dexter

    When you have several GB of memory free kicking around

     

    But who says that I have several GB of memory free kicking around? Right now my task manager says i have 50 megabytes free (out of 2 GB). Are you perhaps running some stone age operaring system that's incapable of putting the available memory to good use? Tongue Out

  • User profile image
    figuerres

    Well first question is "does the OS keep a copy in memory already?" seems like the OS might do that and we just do not see that it has?

     

  • User profile image
    Cream​Filling512

    Everything on NTFS is a file, including the file tables, which are metafiles, so they get cached just the same.

     

    Edit: Also it's indexed with B+ trees, so lookup is very fast.

  • User profile image
    Bass

    This is already done at least on Linux. Actually ext4 even caches writes, to hilarious effect. Smiley

     

    When a feature called "barriers" is turned off, ext4 will detect when a file is constantly being read and write from, and do that entirely in RAM. fsync() has no effect on this. This makes databases insanely fast, but at the cost of possible system integrity. Most desktop OSes turn barriers on, and server OSes tend to turn barriers off.

  • User profile image
    Cream​Filling512

    Bass said:

    This is already done at least on Linux. Actually ext4 even caches writes, to hilarious effect. Smiley

     

    When a feature called "barriers" is turned off, ext4 will detect when a file is constantly being read and write from, and do that entirely in RAM. fsync() has no effect on this. This makes databases insanely fast, but at the cost of possible system integrity. Most desktop OSes turn barriers on, and server OSes tend to turn barriers off.

    That's bizarre, I mean usually database severs want to do their own caching, and control disk flushing since they know better than the OS.

  • User profile image
    Bass

    CreamFilling512 said:
    Bass said:
    *snip*

    That's bizarre, I mean usually database severs want to do their own caching, and control disk flushing since they know better than the OS.

    I don't think most actually do, or if they do, they don't do a particularly good job of it. Really with DBs you expect that if you do an INSERT it will actually happen. So a lot of databases (at least Postgres and SQLite) call fsync after the completion of a simple write operation. Which on barriers enabled FS, tends to block until the data is actually written out to disk. Which is of course, slow.

     

    Without barriers the kernel decides when it feels it is appropriate to actually write the data to disc.This means a lot of file operations (both read and write) are all happening entirely in RAM, and only when the disk is available and it doesn't hamper performence will the kernel persist the contents to disk. This can be as long as 60 seconds after the fsync request was made (or longer?).


    You can fine tweak the performance vs data security, but the more data security you want, the less performance you are going to get (and vise-versa). Just a fact of life I guess. Smiley

  • User profile image
    Cream​Filling512

    Bass said:
    CreamFilling512 said:
    *snip*

    I don't think most actually do, or if they do, they don't do a particularly good job of it. Really with DBs you expect that if you do an INSERT it will actually happen. So a lot of databases (at least Postgres and SQLite) call fsync after the completion of a simple write operation. Which on barriers enabled FS, tends to block until the data is actually written out to disk. Which is of course, slow.

     

    Without barriers the kernel decides when it feels it is appropriate to actually write the data to disc.This means a lot of file operations (both read and write) are all happening entirely in RAM, and only when the disk is available and it doesn't hamper performence will the kernel persist the contents to disk. This can be as long as 60 seconds after the fsync request was made (or longer?).


    You can fine tweak the performance vs data security, but the more data security you want, the less performance you are going to get (and vise-versa). Just a fact of life I guess. Smiley

    Well I'm talking about the commercial database servers where scaling is necessary.  Like Microsoft SQL Server running Hotmail or something.  And if you've ever run MSSQL you know that it will consume all memory on the machine with the out-of-box configuration, because its doing its own disk caching.

  • User profile image
    Bass

    CreamFilling512 said:
    Bass said:
    *snip*

    Well I'm talking about the commercial database servers where scaling is necessary.  Like Microsoft SQL Server running Hotmail or something.  And if you've ever run MSSQL you know that it will consume all memory on the machine with the out-of-box configuration, because its doing its own disk caching.

    Perhaps, but commercial databases tend to guarantee some kind of data integrity, which is impossible unless they persist the contents of a transaction. I don't think they would use write caching by default, as it is fundamentally dangerous to this objective.

     

    I don't think a well designed DB would do extensive read caching either. It's something that is more readily done by a kernel. As everyone has been saying, that's read caching is what most FS do for you (including NTFS) for free.

     

    A DB can provide detailed information about their file I/O requirements through an mmap call anyway.

  • User profile image
    Dexter

    Bass said:
    CreamFilling512 said:
    *snip*

    Perhaps, but commercial databases tend to guarantee some kind of data integrity, which is impossible unless they persist the contents of a transaction. I don't think they would use write caching by default, as it is fundamentally dangerous to this objective.

     

    I don't think a well designed DB would do extensive read caching either. It's something that is more readily done by a kernel. As everyone has been saying, that's read caching is what most FS do for you (including NTFS) for free.

     

    A DB can provide detailed information about their file I/O requirements through an mmap call anyway.

    Indeed, transactional databases require non cached writes (and it's not only about filesystem caching but also about hardware caching).

     

    As for read caching: of course they cache reads, it would be insane not to do it. The filesystem has no magic orb to tell it what exactly to cache, read ahead, discard from cache etc.

  • User profile image
    Cream​Filling512

    From what I understand from Windows internals stuff.  Any time you open a file regardless of how you do it, its implemented internally using memory-mapped files. File gets mapped to some pages in virtual memory, then it gets brought in by on-demand paging and some heuristics to do basic read-ahead, I imagine the paging file and on-demand paging of EXE/DLLs use the same mechanism.  But any kind of I/O APIs, like C's fread() or whatever, never issues I/O requests, you'll just get I/O requests if it touches some memory-mapped file and gets a page fault.

  • User profile image
    Bass

    CreamFilling512 said:

    From what I understand from Windows internals stuff.  Any time you open a file regardless of how you do it, its implemented internally using memory-mapped files. File gets mapped to some pages in virtual memory, then it gets brought in by on-demand paging and some heuristics to do basic read-ahead, I imagine the paging file and on-demand paging of EXE/DLLs use the same mechanism.  But any kind of I/O APIs, like C's fread() or whatever, never issues I/O requests, you'll just get I/O requests if it touches some memory-mapped file and gets a page fault.

    Yeah Windows has a similar call to mmap, and I assume also does paging. Smiley

  • User profile image
    Bass

    Dexter said:
    Bass said:
    *snip*

    Indeed, transactional databases require non cached writes (and it's not only about filesystem caching but also about hardware caching).

     

    As for read caching: of course they cache reads, it would be insane not to do it. The filesystem has no magic orb to tell it what exactly to cache, read ahead, discard from cache etc.

    It doesn't but an OS can accomplish a lot of the intelligent read caching by a paging algorithm, which keeps track of the most commonly read parts of a file. This is easy to implement with the information the kernel gets from mmap and subsequent use of the mmaped space. Most of the performence heavy lifting can be done by the kernel, which knows more about the characters of the persistent store (eg: location of the R/W head, etc).

     

    It would be downright stupid for a DB to take all of this in it's own hands, but I am sure there is some legacy DBs out there who still do all of this, because they were written during a time where MS-DOS was considered advanced. Smiley

  • User profile image
    Cream​Filling512

    Bass said:
    Dexter said:
    *snip*

    It doesn't but an OS can accomplish a lot of the intelligent read caching by a paging algorithm, which keeps track of the most commonly read parts of a file. This is easy to implement with the information the kernel gets from mmap and subsequent use of the mmaped space. Most of the performence heavy lifting can be done by the kernel, which knows more about the characters of the persistent store (eg: location of the R/W head, etc).

     

    It would be downright stupid for a DB to take all of this in it's own hands, but I am sure there is some legacy DBs out there who still do all of this, because they were written during a time where MS-DOS was considered advanced. Smiley

    No way man, the OS has no knowledge of the internal structure of the database.  You can certainly get better performance by doing more work.  Database servers optimize their layout on the physical disk.

  • User profile image
    Bass

    CreamFilling512 said:
    Bass said:
    *snip*

    No way man, the OS has no knowledge of the internal structure of the database.  You can certainly get better performance by doing more work.  Database servers optimize their layout on the physical disk.

    You can optimize the internal structure of the database such that the OS will optimize reads to the fullest.

     

    It's really no different then optimizing instructions, you have no control over branch prediction and cache usage on an x86 processor. But you can still optimize code for branch prediction and cache, by modifying the structure of your program.

  • User profile image
    Dexter

    Bass said:
    Dexter said:
    *snip*

    It doesn't but an OS can accomplish a lot of the intelligent read caching by a paging algorithm, which keeps track of the most commonly read parts of a file. This is easy to implement with the information the kernel gets from mmap and subsequent use of the mmaped space. Most of the performence heavy lifting can be done by the kernel, which knows more about the characters of the persistent store (eg: location of the R/W head, etc).

     

    It would be downright stupid for a DB to take all of this in it's own hands, but I am sure there is some legacy DBs out there who still do all of this, because they were written during a time where MS-DOS was considered advanced. Smiley

    Any application (big enough, complex enough to worth the effort of doing it) can do better than the kernel at caching because an application will always know better than the kernel what data it needs. The kernel can at best obeserve the reads and the writes and some hints passed through the system calls and do some guesswork based on that. The kernel does not have a time machine to look into the future but the application might just have one.

     

  • User profile image
    Bass

    Dexter said:
    Bass said:
    *snip*

    Any application (big enough, complex enough to worth the effort of doing it) can do better than the kernel at caching because an application will always know better than the kernel what data it needs. The kernel can at best obeserve the reads and the writes and some hints passed through the system calls and do some guesswork based on that. The kernel does not have a time machine to look into the future but the application might just have one.

     

    So I don't really agree. I think a program can provide enough hints to a kernel to let the kernel do all the real work. IE: Moving commonly read data to a certain part of a file. Again, similar to x86 optimization.

  • User profile image
    Dexter

    Bass said:
    Dexter said:
    *snip*

    So I don't really agree. I think a program can provide enough hints to a kernel to let the kernel do all the real work. IE: Moving commonly read data to a certain part of a file. Again, similar to x86 optimization.

    Seriously, do you really want/expect a database system to move gigabytes or terrabytes of data around just to keep the kernel happy?

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.