PDFBox – How to read PDF file in Java

This article shows you how to use Apache PDFBox to read a PDF file in Java.

1. Get PDFBox

pom.xml

<dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.6</version>
</dependency>

2. Print PDF file

Example to extract all text from a PDF file.

ReadPdf.java

package com.mkyong;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

import java.io.File;
import java.io.IOException;

public class ReadPdf {

    public static void main(String[] args) throws IOException {

        try (PDDocument document = PDDocument.load(new File("/path-to/abc.pdf"))) {

            document.getClass();

            if (!document.isEncrypted()) {
			
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDFTextStripper tStripper = new PDFTextStripper();

                String pdfFileInText = tStripper.getText(document);
                //System.out.println("Text:" + st);

				// split by whitespace
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }

        }

    }
}
Note
Please refer to this pdfbox svn for more examples

References

  1. Apache PDFBox
  2. iText – Read and Write PDF in Java
author image

mkyong

Founder of Mkyong.com, love Java and open source stuff. Follow him on Twitter. If you like my tutorials, consider make a donation to these charities. Read all published posts by

Comments

avatar
3000
newest oldest most voted
Ankita Tapadia
Guest
Ankita Tapadia

Hi Mykong. I am getting following error on the same problem statement. Could you please guide me on a resolution?

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script1.groovy: 10: unable to resolve class org.apache.pdfbox.text.PDFTextStripper
@ line 10, column 3.
import org.apache.pdfbox.text.PDFTextStripper;
^

1 error

at org.webharvest.runtime.scripting.GroovyScriptEngine.eval(GroovyScriptEngine.java:138) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.ScriptProcessor.execute(ScriptProcessor.java:74) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:127) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:169) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:182) ~[workfusion-webharvest-core.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.StudioWebHarvestTaskExecutor.execute(StudioWebHarvestTaskExecutor.java:108) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.processTaskInputs(SingleThreadWebHarvestProcess.java:75) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.start(SingleThreadWebHarvestProcess.java:44) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.launch(WebHarvestMainLauncher.java:83) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.main(WebHarvestMainLauncher.java:141) ~[com.workfusion.studio.wf_8.4.0.jar:na]
Caused by: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:

Manmaya
Guest
Manmaya

Hi,
Thanks for posting this, is there any way to determine the font name and its size in the particular line of text.

chinagotohell
Guest
chinagotohell

Go to hell you chinese communist

Rituja
Guest
Rituja

I can’t read colum wise pdf document. It read row wise data. For reading the column wise data what we have to do

SHANKAR
Guest
SHANKAR

hi how to read non editable image pdf iwithout installing software in java

Mohamad Basuki
Guest
Mohamad Basuki

Hi Bro, how read multipage pdf, thanks 🙂

Sid
Guest
Sid

Hi Mykong, I have to covert PDF file to HTML and for this I need a java code to fetch formatting of the PDF as well along with the text. For example tables, images, forms etc. Please guide me.
Thanks.

Anonymous
Guest
Anonymous

https://www.baeldung.com/pdf-conversions-java

refer this one it might be useful for you.

kumar
Guest
kumar

Thanks for the help

Smarty
Guest
Smarty

yes really its worty

Shafqat Shafi
Guest
Shafqat Shafi

Thanks a lot for such a neat and informative article.

srinivas
Guest
srinivas

how can get the font style for each line in pdf using pdfbox