Wednesday, June 26, 2013

Extract images from PDF

PDFImageExtractor is a simple program that can extract all images on a PDF document. Sometimes, we don't want to convert PDF pages to image files. We only want to take all images from each page. In this scenario, PDFImageExtractor is useful. It is easy to use. When the program runs, its allows you select a PDF file that contains images to be extracted out. The extracted image files are stored in your current working directory.

extract images from pdf java


PDFImageExtractor source code

import java.awt.image.BufferedImage;
import java.io.File;
import javax.imageio.ImageIO;
import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfObject;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfImageObject;
import javax.swing.filechooser.FileNameExtensionFilter;
import javax.swing.JFileChooser;

public class PDFImageExtractor{
public static void main(String[] args){

selectPDF();
}

//allow pdf file selection for extracting
public static void selectPDF(){

JFileChooser chooser = new JFileChooser();
    FileNameExtensionFilter filter = new FileNameExtensionFilter("PDF","pdf");
    chooser.setFileFilter(filter);
    chooser.setMultiSelectionEnabled(false);
    int returnVal = chooser.showOpenDialog(null);
    if(returnVal == JFileChooser.APPROVE_OPTION) {
File file=chooser.getSelectedFile();
System.out.println("Please wait...");  
            extractImage(file.toString());
System.out.println("Extraction complete");
            }

     
}


public static void extractImage(String src){

try{

//create pdf reader object
PdfReader pr=new PdfReader(src);
PRStream pst;
PdfImageObject pio;
PdfObject po;
int n=pr.getXrefSize(); //number of objects in pdf document
for(int i=0;i<n;i++){
po=pr.getPdfObject(i); //get the object at the index i in the objects collection
if(po==null || !po.isStream()) //object not found so continue
continue;
pst=(PRStream)po; //cast object to stream
PdfObject type=pst.get(PdfName.SUBTYPE); //get the object type
//check if the object is the image type object
if(type!=null && type.toString().equals(PdfName.IMAGE.toString())){
pio=new PdfImageObject(pst); //get the image
BufferedImage bi=pio.getBufferedImage(); //convert the image to buffered image
ImageIO.write(bi, "jpg", new File("image"+i+".jpg")); //write the buffered image
//to local disk

}

}


}catch(Exception e){e.printStackTrace();}

}


}

In the example code above, the getPdfObject(int index) is used to extract an object from the pdf document at the specified index. To determine whether the object is an image, you need to get the type of the object by using the get  method of the stream created from the object.

Note: When you use this program to extract the images from the PDF document, some images might be in wrong order (different from what you see on the PDF pages). It is the problem from iText library itself. I tried to solve this problem with PdfBox. However, it can not be solved.

Merge or Combine PDF, Txt, Images

8 comments:

  1. how to extract images containing fonts such as formula

    ReplyDelete
  2. I'm not a developer, i always use this free online tool to extract images from pdf

    ReplyDelete
  3. Looks kinda difficult... Wonder if I should go to this website or anything similar to learn to code or at least write good papers...

    ReplyDelete
  4. Really enjoyed this article post. Really looking forward to read more. Will read on... visit this website

    ReplyDelete
  5. The PDF standard grants individuals in various areas to chip away at similar archives. https://www.altoconvertpdftojpg.com/faq

    ReplyDelete
  6. Additionally, someone at one point needed to join at least two PDF records into a solitary document. https://altoconvertjpgtopdf.com/about-us

    ReplyDelete