URGENT : Help with parsing the PDF generated by Crystal reports-V9

vinoo80

Hi I am trying to parse the contents of the PDF with iTextSharp using : PdfReader reader = new PdfReader("Test.pdf"); reader.GetPageContent(pageNumber); byte[] pageContentByteArray; I am using this byte array to search for a partcular text based on a Delimiter pattern by converting this to string by using - string test = Encoding.ASCII.GetString(pageContentByteArray); The required text pattern can be matched inside this string. The above logic works absolutely fine if we use a normal PDF input file. My requirement is to read a PDF file which is created by CRYSTAL REPORTS (Version-9). I have a byte array with me. But I tried to convert to string using ASCII, UNICODE , UTF8 string test = Encoding.ASCII.GetString(invoicePageContentByteArray); string test = Encoding.Unicode.GetString(invoicePageContentByteArray); string test = Encoding.UTF8.GetString(invoicePageContentByteArray); I could not find the text pattern in the output string. I guess the PDF generated out of crystal reports is using some other encoding format. (Note : We verified the template used by crystal reports to generate the PDF. The search delimiter pattern is defined as the Text object) Can anyone suggest ideas to resolve the above problem. Thanks, Uma

leckey 0

1. Read the forum guidelines. 2. No one cares if it is urgent. 3. Use the code tags.

Blog link to be reinstated at a later date.

Paul Conrad

leckey wrote:

No one cares if it is urgent.

True, just mark as abuse afterwards.

"The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

leckey 0

I gave him a break since he is new.

Blog link to be reinstated at a later date.

Paul Conrad

Must be just me, being down with the flu today :sigh:

"The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

leckey 0

Ick. I know how that feels. If you have chest issues I tried an old wives' treatment of vicks vaporub on the feet and socks. It did seem to help some.

Blog link to be reinstated at a later date.

Paul Conrad

Not sure if it was a 24 hour flu thing or food poisoning. Regardless of which ever it was, Campbell's Chicken Noodle Soup and Green Tea seems to be doing the job :) Feeling better enough to go for Chicken Enchiladas for dinner.

"The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

vinoo80

Stop this abuse. I am looking for genuine answers.

Paul Conrad

vinoo80 wrote:

I am looking for genuine answers.

Good luck to you :)

"The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

Furty

Paul Conrad wrote:

vinoo80 wrote: I am looking for genuine answers. Good luck to you

Aye. Especially on the CodeProject forums it seems! What ever happened to this place?

Kythen

I don't think the text encoding is your problem. Based on a quick Google search, it looks like GetPageContent doesn't do text extraction for you. It just returns the uncompressed operator stream. You will need to get cozy with the PDF file format and parse those operators to extract the text from the operators. You will also need to use heuristics to figure out how to put the text back together, because text operators don't necessarily appear in the pdf file in the same order as they get displayed. Even then it may not be possible to accurately extract the text. Here's an example of how you'd miss the text given the method you're using now. Searching for "Test" with the following operators would fail:

(T) Tj
(e) Tj
(s) Tj
(t) Tj

And here's an example of where you'd probably never find the text no matter what you do:

1 0 0 1 100 0 Tm
[(t) -10 (s) -10 (e) -10 (T)] TJ

These operators display "Test", but the text you'd likely extract is "tseT". And don't forget to parse the form resources as well. Some pdf file creators like hiding text in forms. And by forms I don't mean forms that you fill out. See the PDF spec for info on form resources. PS: In the future, don't bother saying your question is "Urgent". No one cares, and it's more likely to have your question ignored. I replied because it was a reasonable question and you showed that you at least made a little effort to figure it out yourself.

vinoo80

Thanks a lot, for the valuable inputs. Let me take this from here and report here if I find any solutions.