How to read XML file in Java (SAX Parser)
This tutorial will show you how to use the Java built-in SAX parser to read and parse an XML file.
- 1. What is Simple API for XML (SAX)
- 2. Read or Parse a XML file (SAX)
- 3. Convert an XML file to an object
- 4. SAX Error Handler
- 5. SAX and Unicode
- 6. Download Source Code
- 7. References
1. What is Simple API for XML (SAX)
1.1 The Simple API for XML (SAX) is a push API, an observer pattern, event-driven, serial access the XML file elements sequentially. This SAX parser reads the XML file from start to end, calls one method when it encountered one element, or calls a different method when it found specific text or attribute.
The SAX is fast and efficient, requires much less memory than DOM, because SAX does not create an internal representation (tree structure) of the XML data, as a DOM does.
Note
SAX Parser is faster and uses less memory than DOM parser. SAX is suitable for reading the XML elements sequentially; DOM is suitable for XML manipulation like create, modify or delete the XML elements.
1.2 Some common SAX events :
startDocument()
andendDocument()
– Method called at the start and end of an XML document.startElement()
andendElement()
– Method called at the start and end of a XML element.characters()
– Method called with the text contents in between the start and end of an XML element.
1.3 Below is a simple XML file.
<name>mkyong</name>
The SAX parser read the above XML file and calls the following events or methods sequentially:
startDocument()
startElement()
–<name>
characters()
–mkyong
endElement()
–</name>
endDocument()
2. Read or Parse a XML file (SAX)
This example shows you how to use the Java built-in SAX parser APIs to read or parse an XML file.
2.1 Below is an XML file.
<?xml version="1.0" encoding="utf-8"?>
<Company>
<staff id="1001">
<name>mkyong</name>
<role>support</role>
<salary currency="USD">5000</salary>
<!-- for special characters like < &, need CDATA -->
<bio><![CDATA[HTML tag <code>testing</code>]]></bio>
</staff>
<staff id="1002">
<name>yflow</name>
<role>admin</role>
<salary currency="EUR">8000</salary>
<bio><![CDATA[a & b]]></bio>
</staff>
</Company>
P.S In the XML file, for those special characters like <
or &
, we need to wrap it with CDATA
.
2.2 Create a class to extend org.xml.sax.helpers.DefaultHandler
, and override the startElement
, endElement
and characters
methods to print all the XML elements, attributes, comments and texts.
package com.mkyong.xml.sax.handler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class PrintAllHandlerSax extends DefaultHandler {
private StringBuilder currentValue = new StringBuilder();
@Override
public void startDocument() {
System.out.println("Start Document");
}
@Override
public void endDocument() {
System.out.println("End Document");
}
@Override
public void startElement(
String uri,
String localName,
String qName,
Attributes attributes) {
// reset the tag value
currentValue.setLength(0);
System.out.printf("Start Element : %s%n", qName);
if (qName.equalsIgnoreCase("staff")) {
// get tag's attribute by name
String id = attributes.getValue("id");
System.out.printf("Staff id : %s%n", id);
}
if (qName.equalsIgnoreCase("salary")) {
// get tag's attribute by index, 0 = first attribute
String currency = attributes.getValue(0);
System.out.printf("Currency :%s%n", currency);
}
}
@Override
public void endElement(String uri,
String localName,
String qName) {
System.out.printf("End Element : %s%n", qName);
if (qName.equalsIgnoreCase("name")) {
System.out.printf("Name : %s%n", currentValue.toString());
}
if (qName.equalsIgnoreCase("role")) {
System.out.printf("Role : %s%n", currentValue.toString());
}
if (qName.equalsIgnoreCase("salary")) {
System.out.printf("Salary : %s%n", currentValue.toString());
}
if (qName.equalsIgnoreCase("bio")) {
System.out.printf("Bio : %s%n", currentValue.toString());
}
}
// http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html#characters%28char%5B%5D,%20int,%20int%29
// SAX parsers may return all contiguous character data in a single chunk,
// or they may split it into several chunks
@Override
public void characters(char ch[], int start, int length) {
// The characters() method can be called multiple times for a single text node.
// Some values may missing if assign to a new string
// avoid doing this
// value = new String(ch, start, length);
// better append it, works for single or multiple calls
currentValue.append(ch, start, length);
}
}
2.3 SAXParser
to parse an XML file.
package com.mkyong.xml.sax;
import com.mkyong.xml.sax.handler.PrintAllHandlerSax;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
public class ReadXmlSaxParser {
private static final String FILENAME = "src/main/resources/staff.xml";
public static void main(String[] args) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
// XXE attack, see https://rules.sonarsource.com/java/RSPEC-2755
SAXParser saxParser = factory.newSAXParser();
PrintAllHandlerSax handler = new PrintAllHandlerSax();
saxParser.parse(FILENAME, handler);
} catch (ParserConfigurationException | SAXException | IOException e) {
e.printStackTrace();
}
}
}
Output
Start Document
Start Element : Company
Start Element : staff
Staff id : 1001
Start Element : name
End Element : name
Name : mkyong
Start Element : role
End Element : role
Role : support
Start Element : salary
Currency :USD
End Element : salary
Salary : 5000
Start Element : bio
End Element : bio
Bio : HTML tag <code>testing</code>
End Element : staff
Start Element : staff
Staff id : 1002
Start Element : name
End Element : name
Name : yflow
Start Element : role
End Element : role
Role : admin
Start Element : salary
Currency :EUR
End Element : salary
Salary : 8000
Start Element : bio
End Element : bio
Bio : a & b
End Element : staff
End Element : Company
End Document
2.4 The default SAX Parser will cause XXE attack or CWE-611, read this article to prevent XXE attack in SAX parser.
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
// https://rules.sonarsource.com/java/RSPEC-2755
// prevent XXE, completely disable DOCTYPE declaration:
factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
SAXParser saxParser = factory.newSAXParser();
PrintAllHandlerSax handler = new PrintAllHandlerSax();
saxParser.parse(FILENAME, handler);
} catch (ParserConfigurationException | SAXException | IOException e) {
e.printStackTrace();
}
3. Convert an XML file to an object
This example parses an XML file and converts it into a List
of objects. It works, but not recommended, try JAXB
3.1 Review the same XML file.
<?xml version="1.0" encoding="utf-8"?>
<Company>
<staff id="1001">
<name>mkyong</name>
<role>support</role>
<salary currency="USD">5000</salary>
<!-- for special characters like < &, need CDATA -->
<bio><![CDATA[HTML tag <code>testing</code>]]></bio>
</staff>
<staff id="1002">
<name>yflow</name>
<role>admin</role>
<salary currency="EUR">8000</salary>
<bio><![CDATA[a & b]]></bio>
</staff>
</Company>
3.2 And we want to convert the above XML file into the following Staff
object.
package com.mkyong.xml.sax.model;
import java.math.BigDecimal;
public class Staff {
private Long id;
private String name;
private String role;
private BigDecimal salary;
private String Currency;
private String bio;
//... getters, setters...toString
}
3.3 The below class will do the XML to Object conversion.
package com.mkyong.xml.sax.handler;
import com.mkyong.xml.sax.model.Staff;
import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;
import java.math.BigDecimal;
import java.util.ArrayList;
import java.util.List;
public class MapStaffObjectHandlerSax extends DefaultHandler {
private StringBuilder currentValue = new StringBuilder();
List<Staff> result;
Staff currentStaff;
public List<Staff> getResult() {
return result;
}
@Override
public void startDocument() {
result = new ArrayList<>();
}
@Override
public void startElement(
String uri,
String localName,
String qName,
Attributes attributes) {
// reset the tag value
currentValue.setLength(0);
// start of loop
if (qName.equalsIgnoreCase("staff")) {
// new staff
currentStaff = new Staff();
// staff id
String id = attributes.getValue("id");
currentStaff.setId(Long.valueOf(id));
}
if (qName.equalsIgnoreCase("salary")) {
// salary currency
String currency = attributes.getValue("currency");
currentStaff.setCurrency(currency);
}
}
public void endElement(String uri,
String localName,
String qName) {
if (qName.equalsIgnoreCase("name")) {
currentStaff.setName(currentValue.toString());
}
if (qName.equalsIgnoreCase("role")) {
currentStaff.setRole(currentValue.toString());
}
if (qName.equalsIgnoreCase("salary")) {
currentStaff.setSalary(new BigDecimal(currentValue.toString()));
}
if (qName.equalsIgnoreCase("bio")) {
currentStaff.setBio(currentValue.toString());
}
// end of loop
if (qName.equalsIgnoreCase("staff")) {
result.add(currentStaff);
}
}
public void characters(char ch[], int start, int length) {
currentValue.append(ch, start, length);
}
}
3.4 Run it.
package com.mkyong.xml.sax;
import com.mkyong.xml.sax.handler.MapStaffObjectHandlerSax;
import com.mkyong.xml.sax.model.Staff;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;
public class ReadXmlSaxParser2 {
public static void main(String[] args) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try (InputStream is = getXMLFileAsStream()) {
SAXParser saxParser = factory.newSAXParser();
// parse XML and map to object, it works, but not recommend, try JAXB
MapStaffObjectHandlerSax handler = new MapStaffObjectHandlerSax();
saxParser.parse(is, handler);
// print all
List<Staff> result = handler.getResult();
result.forEach(System.out::println);
} catch (ParserConfigurationException | SAXException | IOException e) {
e.printStackTrace();
}
}
// get XML file from resources folder.
private static InputStream getXMLFileAsStream() {
return ReadXmlSaxParser2.class.getClassLoader().getResourceAsStream("staff.xml");
}
}
Output
Staff{id=1001, name='揚木金', role='support', salary=5000, Currency='USD', bio='HTML tag <code>testing</code>'}
Staff{id=1002, name='yflow', role='admin', salary=8000, Currency='EUR', bio='a & b'}
4. SAX Error Handler
This example shows how to register a custom error handler for the SAX parser.
4.1 Create a class and extends org.xml.sax.ErrorHandler
. Read the code for self-explanation. It just wrapped the originate error message.
package com.mkyong.xml.sax.handler;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import java.io.PrintStream;
public class CustomErrorHandlerSax implements ErrorHandler {
private PrintStream out;
public CustomErrorHandlerSax(PrintStream out) {
this.out = out;
}
private String getParseExceptionInfo(SAXParseException spe) {
String systemId = spe.getSystemId();
if (systemId == null) {
systemId = "null";
}
String info = "URI=" + systemId + " Line="
+ spe.getLineNumber() + ": " + spe.getMessage();
return info;
}
public void warning(SAXParseException spe) throws SAXException {
out.println("Warning: " + getParseExceptionInfo(spe));
}
public void error(SAXParseException spe) throws SAXException {
String message = "Error: " + getParseExceptionInfo(spe);
throw new SAXException(message);
}
public void fatalError(SAXParseException spe) throws SAXException {
String message = "Fatal Error: " + getParseExceptionInfo(spe);
throw new SAXException(message);
}
}
4.2 We use saxParser.getXMLReader()
to get a org.xml.sax.XMLReader
, it provide more options to configure the SAX parser.
package com.mkyong.xml.sax;
import com.mkyong.xml.sax.handler.CustomErrorHandlerSax;
import com.mkyong.xml.sax.handler.MapStaffObjectHandlerSax;
import com.mkyong.xml.sax.model.Staff;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;
public class ReadXmlSaxParser3 {
public static void main(String[] args) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try (InputStream is = getXMLFileAsStream()) {
SAXParser saxParser = factory.newSAXParser();
// parse XML and map to object, it works, but not recommend, try JAXB
MapStaffObjectHandlerSax handler = new MapStaffObjectHandlerSax();
// try XMLReader
//saxParser.parse(is, handler);
// more options for configuration
XMLReader xmlReader = saxParser.getXMLReader();
// set our custom error handler
xmlReader.setErrorHandler(new CustomErrorHandlerSax(System.err));
xmlReader.setContentHandler(handler);
InputSource source = new InputSource(is);
xmlReader.parse(source);
// print all
List<Staff> result = handler.getResult();
result.forEach(System.out::println);
} catch (ParserConfigurationException | SAXException | IOException e) {
e.printStackTrace();
}
}
// get XML file from resources folder.
private static InputStream getXMLFileAsStream() {
return ReadXmlSaxParser2.class.getClassLoader().getResourceAsStream("staff.xml");
}
}
4.3 Update the staff.xml
, remove the CDATA
in the bio
element, and put a &
, and the SAX parser will hit an error.
<?xml version="1.0" encoding="utf-8"?>
<Company>
<staff id="1001">
<name>mkyong</name>
<role>support</role>
<salary currency="USD">5000</salary>
<!-- for special characters like < &, need CDATA -->
<bio>&</bio>
</staff>
</Company>
4.4 Run it with the above custom error handler.
xmlReader.setErrorHandler(new CustomErrorHandlerSax(System.err));
Output
org.xml.sax.SAXException: Fatal Error: URI=null Line=8: The entity name must immediately follow the '&' in the entity reference.
at com.mkyong.xml.sax.handler.CustomErrorHandlerSax.fatalError(CustomErrorHandlerSax.java:41)
at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:181)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
//...
4.5 Run it without a custom error handler.
// xmlReader.setErrorHandler(new CustomErrorHandlerSax(System.err));
Output
[Fatal Error] :8:15: The entity name must immediately follow the '&' in the entity reference.
org.xml.sax.SAXParseException; lineNumber: 8; columnNumber: 15; The entity name must immediately follow the '&' in the entity reference.
at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1243)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
at com.mkyong.xml.sax.ReadXmlSaxParser2.main(ReadXmlSaxParser2.java:44)
5. SAX and Unicode
For XML files containing Unicode characters, by default, SAX can follow the XML encoding (default UTF-8) and parse the content correctly.
5.1 We can define the encoding at the top of the XML file, encoding="encoding-code"
; for example, below is an XML file using the UTF-8
encoding.
<?xml version="1.0" encoding="utf-8"?>
<Company>
<staff id="1001">
<name>揚木金</name>
<role>support</role>
<salary currency="USD">5000</salary>
<bio><![CDATA[HTML tag <code>testing</code>]]></bio>
</staff>
<staff id="1002">
<name>yflow</name>
<role>admin</role>
<salary currency="EUR">8000</salary>
<bio><![CDATA[a & b]]></bio>
</staff>
</Company>
5.2 Alternatively, we can define a specified encoding in the InputSource
.
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(handler);
InputSource source = new InputSource(is);
// set encoding
source.setEncoding(StandardCharsets.UTF_8.toString());
//source.setEncoding(StandardCharsets.UTF_16.toString());
xmlReader.parse(source);
Note
More SAX parser examples – Oracle – Simple API for XML (SAX)
6. Download Source Code
$ git clone https://github.com/mkyong/core-java
$ cd java-xml
$ cd src/main/java/com/mkyong/xml/sax/
7. References
- XML parsers should not be vulnerable to XXE attacks
- Wikipedia – Java API for XML Processing
- Wikipedia – Simple API for XML
- Wikipedia – Document Object Model
- Wikipedia – Observer pattern
- Oracle – Java API for XML Processing (JAXP)
- Oracle – Simple API for XML (SAX)
- Oracle – Document Object Model (DOM)
- How to read XML file in Java – (DOM Parser)
- How to read XML file in Java – (JDOM Parser)
- JAXB hello world example
- How to prevent XML external entity attack (XXE attack)
can u explain with an example what is the difference between SAX Parser and DOM Parser?
DOM – reads all structure into memory, and data stays in memory, and next you can read your data from memory, and make a lot of operations such as search. (useful for small files)
SAX – nothing is stored in memory, so you can’t restore any data by later operations. file is parsed once, and you must catch data while it is parsed (useful for large files)
Thanks for your invaluable inputs. Some extras,
DOM – Random access to the XML file.
SAX – Using callback mechanism, parse XML file from top to bottom hierarchy, ya usually in large file which required access sequentially.
while i try to compile the sax parser code i got the following error:
org.xml.sax.SAXParseException: Content is not allowed in prolog.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at com.mypack.ReadXMLFileSAX.main(ReadXMLFileSAX.java:99)
This may cause by the invalid XML content. please read
https://mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/
Or, can you provide me your XML file?
yes!! Works perfectly
I’ve tried the code, but i have a problem here.
The result is :
End Element :firstname
End Element :lastname
End Element :nickname
End Element :salary
End Element :staff
End Element :firstname
End Element :lastname
End Element :nickname
End Element :salary
End Element :staff
End Element :company
Why does it happen ???
Thanks
Solution (found at: http://stackoverflow.com/questions/6301678/java-sax-program-doesnt-go-to-startelement-method):
Check the import statement for the Attribute parameter, it should be:
import org.xml.sax.Attributes;
Same problem here too 🙁
Thank you very much! Your tutorials have helped me several times.
I tried to read xml file ( approx. 600 MB ) on 3.2GB RAM computer and i got outofmemory exception with XOM, VTD-XML ex. Only this code makes it successfully. Thank you
yes, SAX is designed to parse large XML file. Thanks for sharing your experience.
I think that this is the best tutorial about SAX and XML!
🙂