Tech Off Thread

41 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

Is indexing for Desktop Search, OverKill?

Back to Forum: Tech Off
  • User profile image
    Spectate​Swamp

     

    All my textual data from the last 10 years, is just 70 Megabytes.
    A non indexing application can search that in just over 4 seconds.

    The net search Co's have trotted out their indexed searches
    which are complex database monstrosities designed for millions
    of times, that amount of data.

    A lifetimes reading, can fit on a DVD (4.7 Gig)
    It would be silly, to archive and search info, you have never read.

  • User profile image
    Gaijin

    For home users - Yes. But imagine a corporation where all computers are networked and share information with each other and a database. Depending on the size and function of the corporation the data that amasses over time is far greater than a lifetime's reading. That would warrant an advanced search algorythm, like indexing...

  • User profile image
    blowdart

    Please don't feed the troll/idiot who posts the same thing over and over again, all over the web.

  • User profile image
    Yggdrasil

    SpectateSwamp wrote:
    

     

    All my textual data from the last 10 years, is just 70 Megabytes.
    A non indexing application can search that in just over 4 seconds.

    The net search Co's have trotted out their indexed searches
    which are complex database monstrosities designed for millions
    of times, that amount of data.

    A lifetimes reading, can fit on a DVD (4.7 Gig)
    It would be silly, to archive and search info, you have never read.



    Are we going for the bait'n'switch now? Because some of us do remember that you're always around pimping your desktop search application. 

    Trying to make it look like you're actually trying to have a conversation that just happens to slam your business rivals is a tiny bit suspicious.

    That being said, though, I'll try to hijack the thread for legitimate purposes:

    My current My Documents folder contains about 3.5Gb of data. Many of that is pictures, but that is still searchable data - a good desktop search engine will look in EXIF data on a picture file, use the file and directory name as context, and find me my pictures from the Mercury Rev show even if the file is named IMG_5611.JPG.

    I also have several hundred megabytes of PDF ebooks I downloaded from MSPress, O'Reilly and anyone else who offered free ebooks on technical subjects. These can fill up a DVD really quickly.

    Newer formats have more and more data in them - Word documents, spreadsheets and e-books are formats that contain a lot of non-searchable data or are slower to parse on-the-fly. Again, indexing to the rescue.

  • User profile image
    Spectate​Swamp

    Yggdrasil wrote:
    


    My current My Documents folder contains about 3.5Gb of data. Many of that is pictures, but that is still searchable data - a good desktop search engine will look in EXIF data on a picture file, use the file and directory name as context, and find me my pictures from the Mercury Rev show even if the file is named IMG_5611.JPG.

    I have 5000+ family pictures that are searchable.
    we created the meta data in a text file. What do
    you use to embed the meta data in the picture file?

    a good desktop search will randomly show you your
    pictures at whatever speed you wish.
    Yggdrasil wrote:
    

    I also have several hundred megabytes of PDF ebooks I downloaded from MSPress, O'Reilly and anyone else who offered free ebooks on technical subjects. These can fill up a DVD really quickly.

    I shy away from most formats. Keeping screen captures if I'm
    interested. Capturing any text that I can.
    I might even video in these PDF ebooks for backup and playback??
    Yggdrasil wrote:
    


    Newer formats have more and more data in them - Word documents, spreadsheets and e-books are formats that contain a lot of non-searchable data or are slower to parse on-the-fly. Again, indexing to the rescue.


    Newer formats. What about large font scrolling text, from your
    favourite files?

  • User profile image
    Lee_Dale

    SpectateSwamp wrote:
    

    I shy away from most formats. Keeping screen captures if I'm
    interested. Capturing any text that I can.
    I might even video in these PDF ebooks for backup and playback??



    Simply LOFL!

  • User profile image
    littleguru

    I find it being an overkill too. I'm not a guy that puts all data all around the discs. I have them sorted in folders and that is fine for me. That's also why the search service is disabled in my Vista installation.

  • User profile image
    Yggdrasil

    SpectateSwamp wrote:
    
    a good desktop search will randomly show you your
    pictures at whatever speed you wish.


    Umm, no. A Good desktop search will find a picture I want.
    A good image viewer will let me run a slideshow of my pics. I'm partial to IrfanView, myself.


    SpectateSwamp wrote:

    I shy away from most formats. Keeping screen captures if I'm
    interested. Capturing any text that I can.
    I might even video in these PDF ebooks for backup and playback??


    And instead of keeping a DVD library, I find it much easier to translate them all to flip-books of stick-figures.

    SpectateSwamp wrote:

    Newer formats. What about large font scrolling text, from your
    favourite files?


    You're not even going to answer me, are you? You'll just spout off random so-called features from your sales brochure?

    Sheesh.

  • User profile image
    rjdohnert

    Couldnt get attention in the coffehouse now you move to Techoff?  What a dweeb.  Dont feed the trolls.

  • User profile image
    Spectate​Swamp

    rjdohnert wrote:
    Couldnt get attention in the coffehouse now you move to Techoff?  What a dweeb.  Dont feed the trolls.


    Simple text files process very fast.
    after the first read, they probably reside in memory
    for even faster searches. As the disks and computers get
    faster so will the search results.

    I was thrilled with the speed of text search in the mid 1980's.
    50+ users on a much slower machine and I'd do my searches
    It was still very fast. Now it's just me and my PC.

    That primitive (DEC/VAX) text search was fast enough to search a
    pulp and papermill's transaction data.

    This one has searched telephone exchange toll data files.

    It could be years before a new user, would accumulate enought text
    to have to worry about access speed. Maybe never.

    I'm just saying that these database / indexers are overkill and
    problematic. Is that trolling?

     

  • User profile image
    Sven Groot

    SpectateSwamp wrote:
    It could be years before a new user, would accumulate enought text to have to worry about access speed. Maybe never.

    The problem is that you seem to assume all of this information is in a single file. Doing a full text search on my Visual C++ include directory (which I sometimes do to find the value of a constant or something) takes about a minute, even though it's only 3MB or so. It is however in excess of 200 files, so seek times kill the performance.

    Most people (i.e. everyone who is not you, it seems) do not want to copy/paste hundreds of documents and mails into a single file just so your app can search them. In a sense that is indexing, just manually instead of automatically. This is especially true in cases where you want to preserve rich formatting.

    And computers are not always fast enough. I remember a friend of mine was doing a search and replace in a plain text file that contained 20 years of intra-day stock exchange index values for five different indices. This file was somewhere over 100MB (I forget how long exactly). The search and replace operation took more than 10 minutes.

    And on a constrained platform the problem is even larger. I have written a Japanese dictionary for the PocketPC. The dictionary file is ony about 6MB (I'm using EDICT), but even on my relatively powerful Axim X51v (600MHz CPU) doing a linear search through that file takes in the order of 20 to 30 seconds, far more than the amount of time I'm willing to wait for a simple word lookup. With the index file I can reduce the search to about 2 seconds.

    SpectateSwamp wrote:
    I'm just saying that these database / indexers are overkill and problematic. Is that trolling?

    Not by itself, no. In fact, there's definitely an element of truth there. But you're not "just" saying that. You're saying it over and over again, and you're saying it with the intention of advertising your own search solution, and you keep mentioning random features of your product while saying it. All this despite the fact that is has been made very clear, repeatedly, that we are not interested in your product and in fact think it's very bad. That constitutes trolling, or at the very least spamming.

  • User profile image
    Spectate​Swamp


    Sven Groot wrote:
    The problem is that you seem to assume all of this information is in a single file. Doing a full text search on my Visual C++ include directory (which I sometimes do to find the value of a constant or something) takes about a minute, even though it's only 3MB or so. It is however in excess of 200 files, so seek times kill the performance.

     


    When searching my email it's either the inmail or outmail
    files I search. If I have a thousand files. First I do a quick merge of
    them, then search the resultant file. No directory overhead here.

    I used it to search Visual Basic frm and cls code. Extremely fast
    and all in context. Indexed search isn't about context.

    Sven Groot wrote:

    Most people (i.e. everyone who is not you, it seems) do not want to copy/paste hundreds of documents and mails into a single file just so your app can search them. In a sense that is indexing, just manually instead of automatically. This is especially true in cases where you want to preserve rich formatting.


    Why on earth keep the rich text formatting at the expense of
    portability. Few if any reports / documents are ever printed or
    sent a second time.

    I wish I had a quick utility to dump my emails to a text file.
    Right now it takes me 6 or 7 seconds to forward, cut and
    paste that to my text file. Then delete the email.

    Sven Groot wrote:
    And computers are not always fast enough. I remember a friend of mine was doing a search and replace in a plain text file that contained 20 years of intra-day stock exchange index values for five different indices. This file was somewhere over 100MB (I forget how long exactly). The search and replace operation took more than 10 minutes.

    Most computers are faster than mine.

    A search and replace with this app on a 4 year old laptop would
    probably take less than 30 seconds. It can read 100MB in 6 seconds.
    The string search would maybe add 10 seconds. Then the write.

    I'll test it out on a merge of some of my email files.

    Sven Groot wrote:

    And on a constrained platform the problem is even larger. I have written a Japanese dictionary for the PocketPC. The dictionary file is ony about 6MB (I'm using EDICT), but even on my relatively powerful Axim X51v (600MHz CPU) doing a linear search through that file takes in the order of 20 to 30 seconds, far more than the amount of time I'm willing to wait for a simple word lookup. With the index file I can reduce the search to about 2 seconds.

     


    6MB would take 1/3 second, start to finish. With some of the hits
    being at the beginning ,so any display delay is almost unnoticed.
    If I were you, I'd take along a CD with Real search and your 6MB
    data then borrow a computer. Simple word lookup is always about
    the context and indexers do a pathetic job of displaying that.

     

  • User profile image
    Spectate​Swamp

    Completed a test search and replace on my alltext.txt file. 70MB
    This is all the text I'm ever interested in searching. It took 6 minutes
    to do a search and replace for that. Had I used and external drive for
    the output file. It would probably be done in less than half that time.

    So I was a little over optimistic about the search & replace speed.

    The search and replace option is part of the encryption process.
    When you start dragging around really huge files you need a good
    search and replace. This has it. Indexers don't.

     

  • User profile image
    rjdohnert

    I just searched for your name SpectateSwamp, yep no one is interested.

  • User profile image
    kriskdf

    A find and replace should not take that long for a file that size.  You could write an algorithm to do it in far less time than that.  In fact, the Visual Studio find and replace is very quick.  It is orders of magnitude faster than find and replace in Notepad.

    Now, I don't work on Visual Studio so i don't know how they implemented it, but to make it work fast you would need to use something similar to indexing. Smiley

    And to answer your posts question, NO, indexing is the right tool for the job of desktop search and not overkill.  A customer does not want to do anything other than type some keywords in a search box to find what they are looking for as fast as possible.

    I suppose you would find file cabinets and folders overkill for your home office too.

  • User profile image
    Sven Groot

    kriskdf wrote:
    A find and replace should not take that long for a file that size.

    My friend was doing it in Word I believe, so you can blame them. Wink

    kriskdf wrote:
    You could write an algorithm to do it in far less time than that.  In fact, the Visual Studio find and replace is very quick.  It is orders of magnitude faster than find and replace in Notepad.

    And in VS2005 if you use regular expressions, it can crash the IDE orders of magnitude faster than anything else too. So bad I even wrote a replacement for it.

  • User profile image
    blowdart

    rjdohnert wrote:
    I just searched for your name SpectateSwamp, yep no one is interested.


    His crud regularily gets culled from wikipedia as well.

    Even more amusing is a search via google, apparently he spouts nonsense in video formats in much the same way he does here; where anything pointing out how wrong his is is ignored, and then a new thread started with the same repeated crud a couple of weeks later.

  • User profile image
    Spectate​Swamp

    kriskdf wrote:
    A find and replace should not take that long for a file that size.  You could write an algorithm to do it in far less time than that.  In fact, the Visual Studio find and replace is very quick.  It is orders of magnitude faster than find and replace in Notepad.



    Ok. Do one and tell me how long it takes.
    Mine reads a line at a time, checks it and writes
    out the line with changes to an output file.
    Sequential reads and writes are about as fast
    as you will get. Note. I seldom use the search and
    replace.

    The largest file I search is my outmail file which is 30MB
    It is different from most files I search, in that there are
    a multitude of copies of my resume there. I search this
    file a little different from the others for that reason.

    This search always produces results in context.
    (unless you specify matching lines only)

    In a simple search for 10 occurances throughout a
    70MB file and this one will show you the data in
    context much much faster than any indexing search.

    No context = no good

    It takes years for people to accumulate 100MB of text
    from email, notes and stuff copied from the net. For that
    amount of data. Yes Indexing is OverKill!!!!!!


Conversation locked

This conversation has been locked by the site admins. No new comments can be made.