I'm still using SharpReader but it's supremely annoying to find some blogs you didn't know existed with lots of interesting posts to read and not exposed in a manner that they would show up in SharpReader. And SharpReader has problems as well: the atom feed reading support is not as good as rss and sometimes the blogs have comments on the blog but the comments don't show up in SharpReader. And if there are images in the blog post, I doubt they're still available if the blog goes down by the time I have time to go through the posts.
Maybe there is a better way of achieving the same goal. If so I'm open to suggestions but here's what I have in mind now:
One proposed solution:
1) Engine with easy to add or modify scraping plugins/modules/whatever.
2) Server & storage/db that exposes the scraped content through rss/nntp/html/epub. So even if the blog goes down, the blog including images and comments will still be viewable.
I would imagine someone has already written all this in the ~decade that RSS has been around. Given that NNTP/news clients are probably the optimal way to read blogs and forums given their standard interface and design more suited for storing embedded images and lots of posts, that would probably be the priority over the others. (SharpReader consumes 3 GB of memory when I open it as it loads all that xml of every blog post)
Ideally there would also be a plugin to convert any forum with threads and posts into news reader format.
I did in fact found an rss to nntp bridge made in java (nntp//rss), however that's only half the solution.
Summary: Blogs and forums come and go. In order to preserve the wealth of information and history and to make their usage more convenient and have latency-free skimming through the posts/comments, I think this kind of thing would be nice. Possible additional improvement would be a distributed storage backend, such that everyone using the engine makes their scraped blogs available to everyone else using the engine. This can be achieved by hashing the scraped contents and using public usenet servers as storage in addition to local storage for offline reading. eg. lets say I am on an another planet. It's not practical to expect internet connection there and the latency would be way too huge. So the only practical solution is to take a copy of the entire internet (blogs, forums, news groups and msdn docs will do) when you leave the orbit.
edit: There are some other things to keep in mind: If everyone used this engine and it had distributed shared storage, then people reading the blog posts would not actually hit the blog (or they would hit them on a background thread when they were online, just to ping/refresh&merge updates). The blog author might want to know they are being read and also host ads. So if a distributed content archive was included, there should be a way to let the blog poster know their post was read and also to expose some of the ads through the served rss/nntp/html/epub. (eg. on another column running along vertically with the text)
It'd be tempting to also allow for some visual personalization, atleast if reading through html/epub. However because skimming through posts quickly would be mighty annoying is the background and text color changed all the time, I would at most allow the blog around to expose some custom image tags, that each reader could then decide if they want to see the typical elements in a book: background texture (eg. paper-texture-imitation) and some type of border images (to imitate some old book where the text was surrounded with in some fancy graphics near the margins).
Also tempting would be allowing for interactive scripting tag, but with the assumption that the script runs offline and is completely sandboxed. This would allow interactive content on the scraped page but not for the purpose of tracking/ads and it could be toggled on/off on per blog/post basis or globally, for the html exposed version.
edit2: Now that I thought about this. It would make most sense if bloggers just published their blog as (optionally interactive) books which do not assume internet connectivity. That would solve almost all the issues related to blogs. But not those related to forums. So perhaps the focus should be on "forums over nntp" instead.