QHtmlParser: writing an HTML parser with your brain switched offWhile developing
MiTubo I've recently felt the need of parsing HTML pages: the first problem I wanted to solve was implementing proper RSS feed detection when the user entered a website URL into MiTubo's search box, so that MiTubo would parse the site's HTML, look for
URLs in the
HEAD
section, and let the user subscribe to any video feeds found there.
A quick search in the internet did not provide a clear answer: I found
a Qt HTML parser in (stalled) development, and a few other C++ or C parsers (among the latters,
lexboris the most inspiring), but all of them seem to take the approach of parsing the HTML file into a DOM tree, while I was hoping to find a lightweight SAX-like parser. Pretty much like Python's
html.parser.
Anyway, I don't remember how it happened, but at a certain point I found myself looking at
html.parser
source code, and I was surprised to see how compact it was (apart, of course, for the long list of character references for the HTML entities!). Upon a closer look, it also appeared that the code was not making much use of Python's dynamic typing, so, I thought, maybe I could give it a try to rewrite that into a Qt class. And a few hours later
QHtmlParser was born.
As this post's title suggests, the process of rewriting
html.parser
with Qt was quite straightforward, and the nice thing about it is that I didn't have to spend any time reading the HTML standard or trying to figure out how to implement the parser: I just had to translate Python code into C++ code, and thanks to the nice API of QString (which in many ways resembles Python's — or vice versa) this was not too hard. I even left most of the original code comments untouched, and reused quite a few tests from the test suite.
It was time well spent. :-)
If you think you might need an HTML parser for your Qt application, you are welcome to give it a try. It's not a library, just a set of files that you can import into your project; for the time being I only have a build file for
QBS, but I'll happily accept contributions to make it easier to use QHtmlParser with projects built using other build systems. You can see
herethe changes I made in MiTubo to start using it and detect RSS feed in a webpage's HEAD.
That's all for now. And in case you missed the link before, you can find QHtmlParser
here.