Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. URGENT : Help with parsing the PDF generated by Crystal reports-V9

URGENT : Help with parsing the PDF generated by Crystal reports-V9

Scheduled Pinned Locked Moved C#
helpdata-structuresregexjsonannouncement
12 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • V Offline
    V Offline
    vinoo80
    wrote on last edited by
    #1

    Hi I am trying to parse the contents of the PDF with iTextSharp using : PdfReader reader = new PdfReader("Test.pdf"); reader.GetPageContent(pageNumber); byte[] pageContentByteArray; I am using this byte array to search for a partcular text based on a Delimiter pattern by converting this to string by using - string test = Encoding.ASCII.GetString(pageContentByteArray); The required text pattern can be matched inside this string. The above logic works absolutely fine if we use a normal PDF input file. My requirement is to read a PDF file which is created by CRYSTAL REPORTS (Version-9). I have a byte array with me. But I tried to convert to string using ASCII, UNICODE , UTF8 string test = Encoding.ASCII.GetString(invoicePageContentByteArray); string test = Encoding.Unicode.GetString(invoicePageContentByteArray); string test = Encoding.UTF8.GetString(invoicePageContentByteArray); I could not find the text pattern in the output string. I guess the PDF generated out of crystal reports is using some other encoding format. (Note : We verified the template used by crystal reports to generate the PDF. The search delimiter pattern is defined as the Text object) Can anyone suggest ideas to resolve the above problem. Thanks, Uma

    L K 2 Replies Last reply
    0
    • V vinoo80

      Hi I am trying to parse the contents of the PDF with iTextSharp using : PdfReader reader = new PdfReader("Test.pdf"); reader.GetPageContent(pageNumber); byte[] pageContentByteArray; I am using this byte array to search for a partcular text based on a Delimiter pattern by converting this to string by using - string test = Encoding.ASCII.GetString(pageContentByteArray); The required text pattern can be matched inside this string. The above logic works absolutely fine if we use a normal PDF input file. My requirement is to read a PDF file which is created by CRYSTAL REPORTS (Version-9). I have a byte array with me. But I tried to convert to string using ASCII, UNICODE , UTF8 string test = Encoding.ASCII.GetString(invoicePageContentByteArray); string test = Encoding.Unicode.GetString(invoicePageContentByteArray); string test = Encoding.UTF8.GetString(invoicePageContentByteArray); I could not find the text pattern in the output string. I guess the PDF generated out of crystal reports is using some other encoding format. (Note : We verified the template used by crystal reports to generate the PDF. The search delimiter pattern is defined as the Text object) Can anyone suggest ideas to resolve the above problem. Thanks, Uma

      L Offline
      L Offline
      leckey 0
      wrote on last edited by
      #2

      1. Read the forum guidelines. 2. No one cares if it is urgent. 3. Use the code tags.

      Blog link to be reinstated at a later date.

      P 1 Reply Last reply
      0
      • L leckey 0

        1. Read the forum guidelines. 2. No one cares if it is urgent. 3. Use the code tags.

        Blog link to be reinstated at a later date.

        P Offline
        P Offline
        Paul Conrad
        wrote on last edited by
        #3

        leckey wrote:

        No one cares if it is urgent.

        True, just mark as abuse afterwards.

        "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

        L 1 Reply Last reply
        0
        • P Paul Conrad

          leckey wrote:

          No one cares if it is urgent.

          True, just mark as abuse afterwards.

          "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

          L Offline
          L Offline
          leckey 0
          wrote on last edited by
          #4

          I gave him a break since he is new.

          Blog link to be reinstated at a later date.

          P 1 Reply Last reply
          0
          • L leckey 0

            I gave him a break since he is new.

            Blog link to be reinstated at a later date.

            P Offline
            P Offline
            Paul Conrad
            wrote on last edited by
            #5

            Must be just me, being down with the flu today :sigh:

            "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

            L 1 Reply Last reply
            0
            • P Paul Conrad

              Must be just me, being down with the flu today :sigh:

              "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

              L Offline
              L Offline
              leckey 0
              wrote on last edited by
              #6

              Ick. I know how that feels. If you have chest issues I tried an old wives' treatment of vicks vaporub on the feet and socks. It did seem to help some.

              Blog link to be reinstated at a later date.

              P 1 Reply Last reply
              0
              • L leckey 0

                Ick. I know how that feels. If you have chest issues I tried an old wives' treatment of vicks vaporub on the feet and socks. It did seem to help some.

                Blog link to be reinstated at a later date.

                P Offline
                P Offline
                Paul Conrad
                wrote on last edited by
                #7

                Not sure if it was a 24 hour flu thing or food poisoning. Regardless of which ever it was, Campbell's Chicken Noodle Soup and Green Tea seems to be doing the job :) Feeling better enough to go for Chicken Enchiladas for dinner.

                "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

                V 1 Reply Last reply
                0
                • P Paul Conrad

                  Not sure if it was a 24 hour flu thing or food poisoning. Regardless of which ever it was, Campbell's Chicken Noodle Soup and Green Tea seems to be doing the job :) Feeling better enough to go for Chicken Enchiladas for dinner.

                  "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

                  V Offline
                  V Offline
                  vinoo80
                  wrote on last edited by
                  #8

                  Stop this abuse. I am looking for genuine answers.

                  P 1 Reply Last reply
                  0
                  • V vinoo80

                    Stop this abuse. I am looking for genuine answers.

                    P Offline
                    P Offline
                    Paul Conrad
                    wrote on last edited by
                    #9

                    vinoo80 wrote:

                    I am looking for genuine answers.

                    Good luck to you :)

                    "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

                    F 1 Reply Last reply
                    0
                    • P Paul Conrad

                      vinoo80 wrote:

                      I am looking for genuine answers.

                      Good luck to you :)

                      "The clue train passed his station without stopping." - John Simmons / outlaw programmer "Real programmers just throw a bunch of 1s and 0s at the computer to see what sticks" - Pete O'Hanlon "Not only do you continue to babble nonsense, you can't even correctly remember the nonsense you babbled just minutes ago." - Rob Graham

                      F Offline
                      F Offline
                      Furty
                      wrote on last edited by
                      #10

                      Paul Conrad wrote:

                      vinoo80 wrote: I am looking for genuine answers. Good luck to you

                      Aye. Especially on the CodeProject forums it seems! What ever happened to this place?

                      1 Reply Last reply
                      0
                      • V vinoo80

                        Hi I am trying to parse the contents of the PDF with iTextSharp using : PdfReader reader = new PdfReader("Test.pdf"); reader.GetPageContent(pageNumber); byte[] pageContentByteArray; I am using this byte array to search for a partcular text based on a Delimiter pattern by converting this to string by using - string test = Encoding.ASCII.GetString(pageContentByteArray); The required text pattern can be matched inside this string. The above logic works absolutely fine if we use a normal PDF input file. My requirement is to read a PDF file which is created by CRYSTAL REPORTS (Version-9). I have a byte array with me. But I tried to convert to string using ASCII, UNICODE , UTF8 string test = Encoding.ASCII.GetString(invoicePageContentByteArray); string test = Encoding.Unicode.GetString(invoicePageContentByteArray); string test = Encoding.UTF8.GetString(invoicePageContentByteArray); I could not find the text pattern in the output string. I guess the PDF generated out of crystal reports is using some other encoding format. (Note : We verified the template used by crystal reports to generate the PDF. The search delimiter pattern is defined as the Text object) Can anyone suggest ideas to resolve the above problem. Thanks, Uma

                        K Offline
                        K Offline
                        Kythen
                        wrote on last edited by
                        #11

                        I don't think the text encoding is your problem. Based on a quick Google search, it looks like GetPageContent doesn't do text extraction for you. It just returns the uncompressed operator stream. You will need to get cozy with the PDF file format and parse those operators to extract the text from the operators. You will also need to use heuristics to figure out how to put the text back together, because text operators don't necessarily appear in the pdf file in the same order as they get displayed. Even then it may not be possible to accurately extract the text. Here's an example of how you'd miss the text given the method you're using now. Searching for "Test" with the following operators would fail:

                        (T) Tj
                        (e) Tj
                        (s) Tj
                        (t) Tj

                        And here's an example of where you'd probably never find the text no matter what you do:

                        1 0 0 1 100 0 Tm
                        [(t) -10 (s) -10 (e) -10 (T)] TJ

                        These operators display "Test", but the text you'd likely extract is "tseT". And don't forget to parse the form resources as well. Some pdf file creators like hiding text in forms. And by forms I don't mean forms that you fill out. See the PDF spec for info on form resources. PS: In the future, don't bother saying your question is "Urgent". No one cares, and it's more likely to have your question ignored. I replied because it was a reasonable question and you showed that you at least made a little effort to figure it out yourself.

                        V 1 Reply Last reply
                        0
                        • K Kythen

                          I don't think the text encoding is your problem. Based on a quick Google search, it looks like GetPageContent doesn't do text extraction for you. It just returns the uncompressed operator stream. You will need to get cozy with the PDF file format and parse those operators to extract the text from the operators. You will also need to use heuristics to figure out how to put the text back together, because text operators don't necessarily appear in the pdf file in the same order as they get displayed. Even then it may not be possible to accurately extract the text. Here's an example of how you'd miss the text given the method you're using now. Searching for "Test" with the following operators would fail:

                          (T) Tj
                          (e) Tj
                          (s) Tj
                          (t) Tj

                          And here's an example of where you'd probably never find the text no matter what you do:

                          1 0 0 1 100 0 Tm
                          [(t) -10 (s) -10 (e) -10 (T)] TJ

                          These operators display "Test", but the text you'd likely extract is "tseT". And don't forget to parse the form resources as well. Some pdf file creators like hiding text in forms. And by forms I don't mean forms that you fill out. See the PDF spec for info on form resources. PS: In the future, don't bother saying your question is "Urgent". No one cares, and it's more likely to have your question ignored. I replied because it was a reasonable question and you showed that you at least made a little effort to figure it out yourself.

                          V Offline
                          V Offline
                          vinoo80
                          wrote on last edited by
                          #12

                          Thanks a lot, for the valuable inputs. Let me take this from here and report here if I find any solutions.

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups