extract pdf images
-
Hallo, I'm trying to extract images from pdf. The problem is with some images that, in pdf, are with the attribute Predictor. I think that, with this attribute, the resultant bytes are to be decoded with png algorith, but I don't find it. Another problem is with images that are with 8 bit indexed, in the pdf file I don't file the palette, as in other pdf files. Is there someone who is expert with pdf images? :) I hope my question is understandable. Thank you.
-
Hallo, I'm trying to extract images from pdf. The problem is with some images that, in pdf, are with the attribute Predictor. I think that, with this attribute, the resultant bytes are to be decoded with png algorith, but I don't find it. Another problem is with images that are with 8 bit indexed, in the pdf file I don't file the palette, as in other pdf files. Is there someone who is expert with pdf images? :) I hope my question is understandable. Thank you.
Hi, you seem to be peeking into the PDF file to extract what you need, and that is bound to give you surprises when several different PDF sources are involved. There is another way to approach things, that may or may not meet your needs: using something like Ghostscript[^] you can get an image of a particular page of a PDF document (in any resolution you want), then operate on that. The advantage is you don't need to deal with all the possible ways PDF stores/creates images; the disadvantages are (1) you only get the resolution you asked for and (2) if all you need is a picture you still have to locate it and extract it from the page image. Hope this helps. :)
Luc Pattyn [My Articles] Nil Volentibus Arduum
-
Hi, you seem to be peeking into the PDF file to extract what you need, and that is bound to give you surprises when several different PDF sources are involved. There is another way to approach things, that may or may not meet your needs: using something like Ghostscript[^] you can get an image of a particular page of a PDF document (in any resolution you want), then operate on that. The advantage is you don't need to deal with all the possible ways PDF stores/creates images; the disadvantages are (1) you only get the resolution you asked for and (2) if all you need is a picture you still have to locate it and extract it from the page image. Hope this helps. :)
Luc Pattyn [My Articles] Nil Volentibus Arduum
How to use ghostscript?i have installed it but i don't understand how to run it.
-
How to use ghostscript?i have installed it but i don't understand how to run it.
Ghostscript can be ran in a separate process; it takes parameters from its command line. It is one of the many programs that explain their parameters by running it with a /h, -h or -help argument in a command window. Here is a method I once used, you would have to adapt it to your needs of course.
using System.Diagnostics;
using System.IO;
...
string toolsFolder=@"...";
...
private Bitmap getPngImageFromPDF(int resolution, int pageNumber, string pageName, string inName) {
string args=" -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -dTextAlphaBits=4 "+
"-r" + resolution + " -sPageList=" + pageNumber + " -sOutputFile=" + pageName + " " + inName;
string cmd=Path.Combine(toolsFolder, "gswin32c.exe");
ProcessStartInfo psi=new ProcessStartInfo(cmd, args);
psi.CreateNoWindow=true;
if (!withDebug) psi.WindowStyle=System.Diagnostics.ProcessWindowStyle.Hidden;
Process proc=Process.Start(psi);
proc.WaitForExit();
Bitmap bm=(Bitmap)Image.FromFile(pageName);
return bm;
}Notes: 1. pageName is the name of the file that will be generated by ghostscript. 2. the generated file will be locked as long as the generated bitmap is alive. 3. you should Dispose() of the bitmap when you no longer need it. :)
Luc Pattyn [My Articles] Nil Volentibus Arduum
-
Ghostscript can be ran in a separate process; it takes parameters from its command line. It is one of the many programs that explain their parameters by running it with a /h, -h or -help argument in a command window. Here is a method I once used, you would have to adapt it to your needs of course.
using System.Diagnostics;
using System.IO;
...
string toolsFolder=@"...";
...
private Bitmap getPngImageFromPDF(int resolution, int pageNumber, string pageName, string inName) {
string args=" -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -dTextAlphaBits=4 "+
"-r" + resolution + " -sPageList=" + pageNumber + " -sOutputFile=" + pageName + " " + inName;
string cmd=Path.Combine(toolsFolder, "gswin32c.exe");
ProcessStartInfo psi=new ProcessStartInfo(cmd, args);
psi.CreateNoWindow=true;
if (!withDebug) psi.WindowStyle=System.Diagnostics.ProcessWindowStyle.Hidden;
Process proc=Process.Start(psi);
proc.WaitForExit();
Bitmap bm=(Bitmap)Image.FromFile(pageName);
return bm;
}Notes: 1. pageName is the name of the file that will be generated by ghostscript. 2. the generated file will be locked as long as the generated bitmap is alive. 3. you should Dispose() of the bitmap when you no longer need it. :)
Luc Pattyn [My Articles] Nil Volentibus Arduum
I am not able to generate the png file, the file should have generated in the same folder of the pdf files?
-
I am not able to generate the png file, the file should have generated in the same folder of the pdf files?
If you have trouble with ghostscript, run it manually first, i.e. from inside a Command Prompt (you can copy/paste command lines into it) and look at what it tells you. Only when you're satisfied should you start using C# code and the Process class. The easiest way to get the folder issues sorted is by putting the ghostscript exe and your C# exe in the same folder (say "Debug"); I expect ghostscript will then use paths relative to that ("Debug") folder. NB: there is nothing to install about ghostscript, you can put the exe anywhere, as long as it can be found (e.g. because its folder is added to the PATH environment variable, or because it is in your "current directory"). Alternatively you can tell the Process class where it is, that is what my toolsFolder did. Warning: if you specify a full or partial path for the output, you probably must make sure the folder exists before running ghostscript. And the output location must be writeable, so special folders such as
C:\Program Files\
are a no-no. I can't help you any further, it is all standard Windows behavior as far as I know. :)Luc Pattyn [My Articles] Nil Volentibus Arduum