Tech Off Thread

3 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

Need to parse PDF Document inner text

Back to Forum: Tech Off
  • User profile image
    Shararaaa

    Hi everybody

    i need to parse PDF files to extract its inner string to be indexed and searchable so could anyone help me to find a compnent or a code to make that avilable...

    thanx...

  • User profile image
    Poul​Staugaard

    I'm no expert, but this article should get you started: http://www.codeproject.com/csharp/MgPDFReader.asp

  • User profile image
    kpassaur

    PoulStaugaard said:
    I'm no expert, but this article should get you started: http://www.codeproject.com/csharp/MgPDFReader.asp
    I created a program that does this and will also rename the contents of the PDF by the text parsed from the PDF.

    It was actually designed to OCR and file tiff images by Parsed Text; however an engineering company had PDF's that they were going to convert to tiff images so they could OCR them and parse the text. So instead of doing it that way and have a chance of an OCR error, I built in the ability to extract the PDF text and parse it. There is options to use preserve formating when parsing the extracted text or just pull it out with out formating. I noticed the code project mentioned above does form fields. this utility will pull all text data not just what is in the form fields. It's output is a standard csv file. You can learn more about it at http://www.edocfile.com/filebyocr.htm

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.