Coffeehouse Thread

9 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

Minimal Storage vs Redundency.

Back to Forum: Coffeehouse
  • User profile image
    magicalclick

    In the recent debate, it comes to my attention of the notion of "minimal storage vs redundancy". Either of them are not to be ignored and it is quite hard to maintain balance.

    The simple way to say this is, normally in our program, we use pointers to point to the same object. If it is a tree structure (object is only reference once) and for some sad reason one object in the branch is corrupted, its sub branches are dead. Simply put, cut a tree branch and entire thing falls to the ground. It is not robust and we often program our software in this way. We don't expect an object get corrupted because modern hardware can detect faulty bits and correct itself.

    But, it wouldn't work in a persistent communication based world. The best example is hyperlink. If the website in the chain of hyperlinks is gone, the entire sub-chain is broken. This also happens with old FS when directory is corrupted partially. The entire chain of subfolders are unreachable.

    In order to fault tolerant, the copy has to be duplicated. For example, instead of reaching my page through single website's reference, the listing is duplicated to multiple servers (multiple search engines, not one). The data is duplicated across companies in addition to company's internal fault tolerance. A single article could be easily cached/duplicated with 100 copies just for search engines alone. It is quite scary to look at, but, feasible when multiple companies are involved.

    I am talking about this because in the recent debate, there is a simple notion of "minimal tagging vs tag spamming". Or "minimal referral vs reference spamming".

    Ideally, I simply want to tag/refer my review of a book call "Obama Biography Vol1" with single tag value "amazon.com bookID:1458". At minimal tag, that's all I need. There is a simple systematic and feasible way to crowd source different domains. With such system, I can easily Type "Obama" and traverse citations/track-backs automatically and see my review in the result. The relationship between Obama and "amazon.com bookID:1458" can be inferred using unified query communication between different domains. But, the problem is, if amazon.com ever disappear, my tag becomes a junk.

    In the end, it is better to spam the tag values even though many values can be inferred. After using the tag spam, it will be crawled and indexed by different companies.

    Another big difference is, internal vs external. The minimal way is not fault tolerant, but, can be fixed at smaller local scale, which is not entirely bad. But, at external situation, when a domain disappear, we can't do anything about it expect hoping another company will crawl our generic-tag-spam data.

    Got to type it out, so I can sleep tonight. This is not healthy indeed.

    Leaving WM on 5/2018 if no apps, no dedicated billboards where I drive, no Store name.
    Last modified
  • User profile image
    spivonious

    So how about instead of "amazon.com bookID:1458 Obama Biography Vol1", you say "ISBN:1234567890"? Multiple systems know about ISBNs, so if amazon.com disappears, it could seamlessly switch to something else that knows ISBNs.

    All that's needed is a way for domains to communicate their capabilities. Something like the Windows 8 app contracts but cross-platform and distributed.

    Minimal tagging can be successful and fault-tolerant with the right system.

  • User profile image
    W3bbo

    With respect to "internal state" of a program, especially in native code, I make my programs fail (i.e. terminate) as soon as any internal consistency errors are detected - if anything is corrupted it means the program is now unreliable, and in many cases data loss is preferable to data corruption (i.e. better to lose what was in RAM, than to potentially get HDD data overwritten).

    Integrity due to environmental factors is the responsibility of hardware, not software. The only redundancy in a system should be in case of hardware failure (e.g. RAID or PSUs).

    As for redundancy in data: I actively avoid it: as long as rest of the system is correct you'll end up with opportunities for new bugs when you have to ensure that every redundant piece of data is updated atomically. In practice you'll find that the only consistency errors are those introduced by your own bugs.

    My own pet peeve is with RFC 1123  date formatting: it adds the day name to the date string which adds redundancy that requires complicated validation schemes (too many people don't use built-in date/time libraries).

  • User profile image
    evildictait​or

    ++. Your entire system is built with error-detection and error-correction in mind - your RAM fails all the time and Windows corrects for it. Your disk will write bad sectors out to disk and your OS will re-write them. Your processor will incorrectly calculate things on the ALUs but the errors are detected and the operations re-submitted. Even your entire NTFS partition is built so that if the system bluescreens when you're writing a file out to disk you don't lose your entire partition.

    The entire of your system is built around a simple premise: Ring3 programs should "just work" and are conceptually easy to understand. That's why ring3 programs have linear memory that just works, why you have threads that don't stomp on each other's stacks, why you can "load libraries" and "start processes" and think in terms of files and folders instead of having to get your head around disk sectors.

    Now since there are an awful lot of PhDs, professors and senior engineers have gone to so much effort to ensure that your system will never go wrong - we can only assume that errors are either:

    a) Too catastrophic to guard against (e.g. your building burns down). This requires backups at another physical location, and cannot be guarded against in software.

    b) The errors are due to bugs in your software or foolish human errors - and these errors are probably best avoided by good program design rather than having awfully clever software which detects and corrects these errors - and anyway, if your primary data is broken, who's to say your backups are any better?

    Surely the only defence to (b) is to build your software to be as simple as possible and to stop and alert a developer whenever something "strange" happens rather than trying to second-guess what might (and probably won't) ever happen in real life. There are enough weird side-cases that are poorly tested causing stability and security issues in modern software without adding your own. If your program comes up against an inconsistency then stop - at best you might just be halting before you make it worse (and often you detect errors relatively soon after you caused them, so it'll be easier to debug) - at worst you might have just stopped a hacker who made your system inconsistent in order to exploit it by going down a code-path that you never properly tested or is too clever for it's own good.

  • User profile image
    magicalclick

    , spivonious wrote

    So how about instead of "amazon.com bookID:1458 Obama Biography Vol1", you say "ISBN:1234567890"? Multiple systems know about ISBNs, so if amazon.com disappears, it could seamlessly switch to something else that knows ISBNs.

    All that's needed is a way for domains to communicate their capabilities. Something like the Windows 8 app contracts but cross-platform and distributed.

    Minimal tagging can be successful and fault-tolerant with the right system.

    Yeah, Global ID should solve this issues because multiple book stores used the same ID thus take away the local dependency. But, it is very hard to enforce that completely, especially having a GID to my pet puppy. And who actually creates GID for Obama is a big question. Or GID of an immigrant because you know the government is a local domain, not a global one.

    That's why my tag is "amazon.com localProductID" because I have to consider the case where GID is impossible to enforce. But yeah, because of specifying the domain, it loses generality and becomes dependent. 

    Leaving WM on 5/2018 if no apps, no dedicated billboards where I drive, no Store name.
    Last modified
  • User profile image
    magicalclick

    @W3bbo:

    stop server and fix the data is the last thing I want to do, IMO. It is highly recommended to use transactions and let it reject gracefully. And then fix the problem without interfere operations. If I have to stop the server, it is indeed very very bad.

    Leaving WM on 5/2018 if no apps, no dedicated billboards where I drive, no Store name.
    Last modified
  • User profile image
    AndyC

    @magicalclick: To some extent it depends what it means to the end user in the failure scenario. You mention, for example, a link to a review on Amazon and what happens if Amazon itself disappears. In that case, what would the alternative behaviour be? Would the use expect to still see the contents of the old review? Would they expect to see a review from elsewhere? How big an issue will it be if the loss of access to Amazon is merely a transitory fault?

    W3bbo makes a great point about the problems of storing redundant data and then trying to keep it in sync, however in the specific example what is actually likely to change? The title of the book probably won't, presumably neither will other details like author, ISBN etc. If you can gain needed resilience by also storing these and not worrying too much about them staying in sync, will you improve the experience for the end user in a failure situation?

  • User profile image
    magicalclick

    @AndyC:

    Hello, when you include, Book Name, ISBN, and Author. It is already redundant. In the minimal tag, all you need is ISBN. You can get Book Name and Author by searching ISBN in other domains. spivonious did solve the problem using GID such as ISBN (if that's international), but, not everything has GID.

    If you are unsure, the way to get to a book review is,

    My review in reviews.com and its tag = "Biography.com ISBN". That's the only tag value needed.

    • Search US in Country.com, drill down to Presidents.com, drill down to Obama.com, drill down to Biography.com, drill down to reviews.com.

    you can also search using this path.

    • Search Brown in Color.com, drill down to SkinColorPeople.com, drill down to Obama.com, drill down to Biography.com, drill down to reviews.com.

    As you can see, my review can be reached from US President or Brown Skin Color, without me including those facts. Minimal tag only requires single link, everything else can be inferred. This is very powerful when you can really crowd source all these separate domains instead of relying on search engines reading tag-spam and build relationships internally.

     

    Leaving WM on 5/2018 if no apps, no dedicated billboards where I drive, no Store name.
    Last modified
  • User profile image
    AndyC

    Indeed, although from what I remember (thus entirely feasibly wrong) an ISBN number is specific to the exact version fo a book. So if the hardback goes out of print, but Amazon were to use your review on the otherwise identical paperback you've potentially once again lost a trail because your GID no longer features in the results again.

    And obviously if you expand the system out from merely books to something else, you once again risk venturing into something that no longer fits quite so easily into such a minimalist model because it doesn't come with some ready-made unique ID. It's a generic risk/reward question at the end of the day, what are you prepared to risk by storing so little about something you already know a lot on (instead relying on third-parties) and is the trade off financially worth it?

    Sometimes it's very worth it, trying to replicate the dynamic and complex data around an individuals friendships is probably vastly more complicated than putting the faith in Facebook to supply it for you. On the other hand, remembering that an ISBN matches to a specific author/title? Not sure the savings outweigh the risks quite so much.

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.