Control characters that lxml doesn't like, which are many. Notably '\x0c', which Google docs generates and is a page break. See similar bug across the git pond: https://github.com/html5lib/html5lib-python/issues/96