Searching PDF with C#

Craig Suthers

I am trying to find a way to read the text out of a PDF file as a search facility. This is a requirement that a client has asked me to quote on. My application is written in C#. I am having a lot of difficulty trying to find any examples or even .NET components that I could purchase to do this task. This application will be running on a server with very limited permissions. I have no ability to install standard COM components. Hope someone can help. Enjoy Craig

Matthew Hazlett

PDF'S wern't really made for people to be able to rip text contents. However I did see somthing for .NET and PDF'S that I "think" can read them. I am not sure I never tested this out but just remember the link. http://sourceforge.net/projects/pdflibrary/[^] The project is now defunct but maybe you can use some of the code. Hope this helps Matthew Hazlett Windows 2000/2003 MCSE Never got an MCSD, go figure...

Craig Suthers

Thanks for your help. I looked into that product, which was stopped at a very green age. There is also itextsharp.sourceforge.net[^] which seems to be extensive in its generation ability but specifically points out the same as you have, that PDF's aren't made to rip text. So the solution I have found that works for me is that Adobe has a filter for Microsofts indexing server that allows for searching through PDF files. See http://www.adobe.com/support/salesdocs/1043a.htm[^] for more information. I can use Chris Mauders article on using indexing server as a search facility to code the rest. Enjoy Craig

Heath Stewart

Actually, PDFs weren't made to stop "ripping of text", but to solve a common problem - to provide a standard format for delivering rich content on the web (among other media). You can get the text because the text is available in PDFs unless the page is one giant image (which is rare). There's many ways to get text. One easy way is to install Adobe PDF IFilter 5.0[^]. An IFilter is an interfaces that COM servers implement to facilitate searching of text. Office installs their own implementation, and Windows 2000 and higher have default IFilter implementations for searching text files, HTML documents, and several other common formats. While this would be easiest to use in C++, you can P/Invoke the necessary APIs and redeclare the interfaces so that you can use them in C#. There is an example that gets the IFilter for a doc (the system provides the right implementation, so you could easily replace the .doc filename with a .pdf filename) here: http://sqljunkies.com/weblog/acencini/posts/716.aspx[^].

Microsoft MVP, Visual C# My Articles

Philip Fitzsimons

adobe provide and API that allows you to do this... its called the Adobe Search API: which can be found at http://partners.adobe.com/asn/acrobat/docs.jsp[^]

"When the only tool you have is a hammer, a sore thumb you will have."