Java programs: Extract images from PDF

Wednesday, June 26, 2013

Extract images from PDF

PDFImageExtractor is a simple program that can extract all images on a PDF document. Sometimes, we don't want to convert PDF pages to image files. We only want to take all images from each page. In this scenario, PDFImageExtractor is useful. It is easy to use. When the program runs, its allows you select a PDF file that contains images to be extracted out. The extracted image files are stored in your current working directory.

PDFImageExtractor source code

import java.awt.image.BufferedImage;
import java.io.File;
import javax.imageio.ImageIO;
import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfObject;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfImageObject;
import javax.swing.filechooser.FileNameExtensionFilter;
import javax.swing.JFileChooser;

public class PDFImageExtractor{
public static void main(String[] args){

selectPDF();
}

//allow pdf file selection for extracting
public static void selectPDF(){

JFileChooser chooser = new JFileChooser();
FileNameExtensionFilter filter = new FileNameExtensionFilter("PDF","pdf");
chooser.setFileFilter(filter);
chooser.setMultiSelectionEnabled(false);
int returnVal = chooser.showOpenDialog(null);
if(returnVal == JFileChooser.APPROVE_OPTION) {
File file=chooser.getSelectedFile();
System.out.println("Please wait...");
extractImage(file.toString());
System.out.println("Extraction complete");
}

}

public static void extractImage(String src){

try{

//create pdf reader object
PdfReader pr=new PdfReader(src);
PRStream pst;
PdfImageObject pio;
PdfObject po;
int n=pr.getXrefSize(); //number of objects in pdf document
for(int i=0;i<n;i++){
po=pr.getPdfObject(i); //get the object at the index i in the objects collection
if(po==null || !po.isStream()) //object not found so continue
continue;
pst=(PRStream)po; //cast object to stream
PdfObject type=pst.get(PdfName.SUBTYPE); //get the object type
//check if the object is the image type object
if(type!=null && type.toString().equals(PdfName.IMAGE.toString())){
pio=new PdfImageObject(pst); //get the image
BufferedImage bi=pio.getBufferedImage(); //convert the image to buffered image
ImageIO.write(bi, "jpg", new File("image"+i+".jpg")); //write the buffered image
//to local disk

}

}

}catch(Exception e){e.printStackTrace();}

}

}

In the example code above, the getPdfObject(int index) is used to extract an object from the pdf document at the specified index. To determine whether the object is an image, you need to get the type of the object by using the get method of the stream created from the object.

Note: When you use this program to extract the images from the PDF document, some images might be in wrong order (different from what you see on the PDF pages). It is the problem from iText library itself. I tried to solve this problem with PdfBox. However, it can not be solved.