page 1 of 1
Comments: 11 | Views: 5764
manickernel
manickernel
anticipate consequences..
I am hoping you C9 gurus can help me here. I am a sysadmin for a small county government. The particular issue I have here is the need to search through a large dump of pdf files of archived issues of local newspapers for our library. Essentially "google" the back issues for relevant terms. I am looking at WSS with the adobe Ifilter as one possibility, but wanted to know what other options were available. This will probably expand to other document formats in the future.

Essentially this seems not to be too much different from current "desktop search" but obviously on a larger scale depending on an indexed database.

On a side note. While the technology for digitizing documents and making them widely available has been around for some time, (We use Legato and some custom integrated stuff in some departments) it seems that the demand for this to be available publicly has taken off exponentially in government now.

So what's going on out there? What's good? I am not averse to using a Linux/OSS or Windows solution. I just like what works.  
Nata1
Nata1
.Search - Google Appliance killer
you can get about 15 million docs in about 100 gigs of disc space, and we have something new (using Lucene) that is about 2/3rds that.

The only drawback is you'll have to write the pdf filter yourself, but .Search is free and you get lots more control over ranking configuration, security, rating, commenting, and actually getting to see what the index processed - something Google appliance doesn't give you.

And the source is free too

You could also use Nata1.2 which has been around for a couple of years but again you'd have to write your own pdf filter.
blowdart
blowdart
Peek-a-boo
No-one loves good old index server any more. Which is a shame, because it works well. Dedicate a whole machine to indexing and just leave it running.
Nata1
Nata1
.Search - Google Appliance killer
If the content is static, then blowdart is right.

Also, you can use Nata1.2 for the UI elements, (it lets you switch between google webservice - nata indexer, or Index Server).

I have all the code in 1.2 to get you going
Maurits
Maurits
AKA Matthew van Eerde
If it was me, I'd dig around for a PDF text-extraction utility that I could automate.  Then I'd set up a tool to iterate over all the .pdfs and dump content and metadata (file name, creation timestamp, etc.) into a SQL database.  I'd turn on full-text indexing on the content column.

Then I'd configure the tool to periodically dump "new" content and metadata, and make that a SQL Agent job.

But as with all things, the answer will depend on your personal expertise and the tools already available to you.
billh
billh
call -141
Where are the pdf files stored?  Is it something you have control over? Or are they provided by the newspaper company? Or is it contained on a library server somewhere?  And does someone manually scan these pages in?
Maurits
Maurits
AKA Matthew van Eerde
billh wrote:
And does someone manually scan these pages in?


Ooh good question... does OCR raise its ugly head...
Nata1
Nata1
.Search - Google Appliance killer
both of you guys are talking about solutions that would take a fair amount of development, at least a week of coding if everything went perfect.

What he was looking for was packages that allow you to do this type of thing.  There's a couple of solutions out there in .net that let you index up to 1000 web pages.

There's nothing that lets you index 15 million documents or so - I'm not plugging my FOSS solutions (Nata or .Search) but its just the way things are if you don't have the money for a google mini, even there you can only go up to 100,000 docs and you get no ranking control.

Someone correct me if I'm wrong, but there's nothing out there that does this, actually there is this but its pretty high priced, these guys have been around for a couple years - http://www.dtsearch.com/ 

Blowdart is right about Index server pretty powerful, and if you need some more meta data you can use sql full text query which actually uses Index Server - however these options will take a week or more to setup, the later solution could take you even more time, plus you have to create the presentation layer.

But pitching my FOSS stuff, instead of writting a custom Index server solution just use Nata1.2 with Index server as the search query provider, and if you need more power over ranking then use .Search -

Index server has a limit of 2 billion documents in a single index, .Search has a limit of about 15 million at this time
Nata1
Nata1
.Search - Google Appliance killer
Manickernel, check out dtSearch desktop search, get the 30 day trial and index your pdfs once - (or some sites)

http://www.dtsearch.com/download.html

I won't comment - it is a pretty neat search solution, but... there are other options that *may* be better Smiley and cheaper by about 1000 bucks
If you have money ($2,995) you could buy a Google Search Appliance server and dump it in the network... Visit here for more info.
Nata1
Nata1
.Search - Google Appliance killer
your talking about a google Mini which is 3 grand - and is limited to 100,000

Google appliance which caps at about 15 million documents is about 30 grand.

IMHO, Google is a great internet search solution because of detecting spam and relevancy - but google doesn't give the developer any control over ranking, and google doesn't give source code.

Also, to the OP, you might want to check out .Lucene - I trust that much more than I trust a google mini - why waste your money and trap yourself? 

but go out to google and take one of the product demos so you can see how many configuration options you get (not many, you can specify username/password for a directory, and you can view a report of what people are searching for, and you can specify an xslt sheet to override the canned UI).

I did all that two years ago and were alot further now.  With Google your paying for highly refined relevancy algorithms that apply to the internet and spam detection, links in isn't anything new, .search has it.  With dtSearch and .Search you have much richer control over ranking and relevancy, except I give even more because I allow ranking modifiers not just for fuzzy searches but for stemming, distance, order, links in, word count, etc. plus you can code in as many things as you want.

I hate to irk anyone here, but Google Mini/Appliance is a joke, its overpriced, there's no flexibility, there's no source code, its a rip-off, its just a name.  I would trust .Lucene over that anyday of the week.  BTW .Lucene is now running on www.asp.net - and because .Search is built on top of community server you can use that indexer if you want.
page 1 of 1
Comments: 11 | Views: 5764
Microsoft Communities