SAX - Invalid byte 1 of 1-byte UTF-8 sequence

Note
Since Java 8, the JDK built-in Apache Xerces XML parser has been improved a lot, and the majority of the IO classes now default to UTF-8. We cannot simulate the same error anymore, and the internal Xerces APIs are now printing ? for unknown characters instead of showing the error Invalid byte 1 of 1-byte UTF-8 sequence.

Table of contents

1. SAX parser unable to parse UTF-8 XML?
2. Character encoding in XML
3. Character encoding in source code
4. Download Source Code
5. References

1. SAX parser unable to parse UTF-8 XML?

In the old days, the built-in JDK Apache Xerces XML Parser will prompt an error Invalid byte 1 of 1-byte UTF-8 sequence if we try to parse an XML file contains UTF-8 characters.

For example, an XML file contains a special symbol and two Chinese characters.

staff.xml


<?xml version="1.0"?>
<company>
    <staff>
        <firstname>yong</firstname>
        <lastname>木金</lastname>
        <nickname>§</nickname>
        <salary>100000</salary>
    </staff>
</company>

And we use the SAX parser to process the above XML file.


  SAXParserFactory factory = SAXParserFactory.newInstance();

  try {

      SAXParser saxParser = factory.newSAXParser();

      PrintAllHandlerSax handler = new PrintAllHandlerSax();
      saxParser.parse("c://test//staff.xml", handler);

  } catch (ParserConfigurationException | SAXException | IOException e) {
      e.printStackTrace();
  }

Output

Terminal


com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
  Invalid byte 1 of 1-byte UTF-8 sequence.

2. Character encoding in XML

If an XML file is missing the encoding, the parser will default to UTF-8.

staff.xml


<?xml version="1.0"?>
<company>
    <staff>
        <firstname>yong</firstname>
        <lastname>木金</lastname>
        <nickname>§</nickname>
        <salary>100000</salary>
    </staff>
</company>

It’s best practice to define a character encoding for an XML file.

staff.xml


<?xml version="1.0" encoding="utf-8"?>
<company>
    <staff>
        <firstname>yong</firstname>
        <lastname>木金</lastname>
        <nickname>§</nickname>
        <salary>100000</salary>
    </staff>
</company>

3. Character encoding in source code

The incorrect character encoding will cause the Invalid byte 1 of 1-byte UTF-8 sequence. For example, we read the XML data as UTF-8, but it is different encoding like ISO_8859_1.

To fix it, we can define a character encoding for the SAX parser.

ReadXmlSaxParser.java


package com.mkyong.xml.sax;

import com.mkyong.xml.sax.handler.PrintAllHandlerSax;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class ReadXmlSaxParser {

  //private static final String FILENAME = "src/main/resources/staff.xml";
  private static final String FILENAME = "c://test//staff.xml";

  public static void main(String[] args) {

      SAXParserFactory factory = SAXParserFactory.newInstance();

      try {

          SAXParser saxParser = factory.newSAXParser();

          PrintAllHandlerSax handler = new PrintAllHandlerSax();
          //saxParser.parse(FILENAME, handler);

          XMLReader xmlReader = saxParser.getXMLReader();
          xmlReader.setContentHandler(handler);

          InputSource source = new InputSource(FILENAME);

          // different encoding
          source.setEncoding(StandardCharsets.UTF_8.displayName());

          xmlReader.parse(source);

      } catch (ParserConfigurationException | SAXException | IOException e) {
          e.printStackTrace();
      }

  }

}

4. Download Source Code

$ git clone https://github.com/mkyong/core-java

$ cd java-xml

$ cd src/main/java/com/mkyong/xml/

5. References

5 Comments

Most Voted

Newest Oldest

Inline Feedbacks

View all comments

Murodhon

12 years ago

Thanks, This helped me resolve my issue to. mkyong.com is great and very useful

Oladeji Oluwasayo

13 years ago

I use your articles a lot in my not-so-JavaEE development. You’ve been an extremely valuable resource!

Shaik Allabakash

Awesome buddy….It worked like magic…Many results came while looking for this in the search engines but most of them suggested to modify the source file. Yours is the only one that provided the right solution.

vishal solankee

Thanks a lot for providing such hands-on!!

Yuchen

I have a question. How to test if a string is a valid UTF-8 in Java? I know how to detect whether byte[] is UTF-8 but I have no idea about how to detect whether a java string is valid UTF-8?

Thanks

1. SAX parser unable to parse UTF-8 XML?

2. Character encoding in XML

3. Character encoding in source code

4. Download Source Code

5. References

mkyong

Related Posts