Main Tutorials

SAX – Invalid byte 1 of 1-byte UTF-8 sequence

Note
Since Java 8, the JDK built-in Apache Xerces XML parser has been improved a lot, and the majority of the IO classes now default to UTF-8. We cannot simulate the same error anymore, and the internal Xerces APIs are now printing ? for unknown characters instead of showing the error Invalid byte 1 of 1-byte UTF-8 sequence.

Table of contents

1. SAX parser unable to parse UTF-8 XML?

In the old days, the built-in JDK Apache Xerces XML Parser will prompt an error Invalid byte 1 of 1-byte UTF-8 sequence if we try to parse an XML file contains UTF-8 characters.

For example, an XML file contains a special symbol and two Chinese characters.

staff.xml

<?xml version="1.0"?>
<company>
    <staff>
        <firstname>yong</firstname>
        <lastname>木金</lastname>
        <nickname>§</nickname>
        <salary>100000</salary>
    </staff>
</company>

And we use the SAX parser to process the above XML file.


  SAXParserFactory factory = SAXParserFactory.newInstance();

  try {

      SAXParser saxParser = factory.newSAXParser();

      PrintAllHandlerSax handler = new PrintAllHandlerSax();
      saxParser.parse("c://test//staff.xml", handler);

  } catch (ParserConfigurationException | SAXException | IOException e) {
      e.printStackTrace();
  }

Output

Terminal

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
  Invalid byte 1 of 1-byte UTF-8 sequence.

2. Character encoding in XML

If an XML file is missing the encoding, the parser will default to UTF-8.

staff.xml

<?xml version="1.0"?>
<company>
    <staff>
        <firstname>yong</firstname>
        <lastname>木金</lastname>
        <nickname>§</nickname>
        <salary>100000</salary>
    </staff>
</company>

It’s best practice to define a character encoding for an XML file.

staff.xml

<?xml version="1.0" encoding="utf-8"?>
<company>
    <staff>
        <firstname>yong</firstname>
        <lastname>木金</lastname>
        <nickname>§</nickname>
        <salary>100000</salary>
    </staff>
</company>

3. Character encoding in source code

The incorrect character encoding will cause the Invalid byte 1 of 1-byte UTF-8 sequence. For example, we read the XML data as UTF-8, but it is different encoding like ISO_8859_1.

To fix it, we can define a character encoding for the SAX parser.

ReadXmlSaxParser.java

package com.mkyong.xml.sax;

import com.mkyong.xml.sax.handler.PrintAllHandlerSax;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class ReadXmlSaxParser {

  //private static final String FILENAME = "src/main/resources/staff.xml";
  private static final String FILENAME = "c://test//staff.xml";

  public static void main(String[] args) {

      SAXParserFactory factory = SAXParserFactory.newInstance();

      try {

          SAXParser saxParser = factory.newSAXParser();

          PrintAllHandlerSax handler = new PrintAllHandlerSax();
          //saxParser.parse(FILENAME, handler);

          XMLReader xmlReader = saxParser.getXMLReader();
          xmlReader.setContentHandler(handler);

          InputSource source = new InputSource(FILENAME);

          // different encoding
          source.setEncoding(StandardCharsets.UTF_8.displayName());

          xmlReader.parse(source);

      } catch (ParserConfigurationException | SAXException | IOException e) {
          e.printStackTrace();
      }

  }

}

4. Download Source Code

$ git clone https://github.com/mkyong/core-java

$ cd java-xml

$ cd src/main/java/com/mkyong/xml/

5. References

About Author

author image
Founder of Mkyong.com, love Java and open source stuff. Follow him on Twitter. If you like my tutorials, consider make a donation to these charities.

Comments

Subscribe
Notify of
5 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
Murodhon
10 years ago

Thanks, This helped me resolve my issue to. mkyong.com is great and very useful

Oladeji Oluwasayo
10 years ago

I use your articles a lot in my not-so-JavaEE development. You’ve been an extremely valuable resource!

Shaik Allabakash
11 years ago

Awesome buddy….It worked like magic…Many results came while looking for this in the search engines but most of them suggested to modify the source file. Yours is the only one that provided the right solution.

vishal solankee
11 years ago

Thanks a lot for providing such hands-on!!

Yuchen
11 years ago

I have a question. How to test if a string is a valid UTF-8 in Java? I know how to detect whether byte[] is UTF-8 but I have no idea about how to detect whether a java string is valid UTF-8?

Thanks