What is the best way to parse HTML?

You must Login before you can answer or comment on any questions.

Hi I'm trying to parse HTML. When I passed HTML as string into Titanium.XML.parseString(), it crashed. Tried something like this:

http.send(); // http is a synchronous http client
var result = http.responseText;
var dom = Titanium.XML.parseString(result);//crash!!
My error is like this:

[ERROR] Error Domain=com.google.GDataXML Code=-1 "The operation couldn’t be completed. (com.google.GDataXML error -1.)". in -[TiDOMDocumentProxy parseString:] (TiDOMDocumentProxy.m:48)

Am I doing something wrong? Titanium.XML.parseString just can't parse HTML? Then is there any way to parse HTML? I need something like getElementById, getElementsByClassName....

3 Answers

node-htmlparser

node-soupselect

Modify these two to run under Titanium's system. These allow you to parse non XML correct HTML.

What I did was this:

Modified htmlparser to expose it's exports to a regular object and used Ti.include to "include the file as if it was written there".

I did the same for soupselect, and they worked well together and passed the unit tests :)

Essentially I added this to the top of the source files:

exports = {};
and this at the bottom:
htmlparser = exports;
with soupselect, I had to substitute the line:
var domUtils = require('htmlparser').DomUtils;
with
var domUtils = htmlparser.DomUtils;
Ti.include('htmlparser.js');
Ti.include('soupselect.js');
 
var select = soupselect.select;
 
var body = '<html><head><title>Test</title></head>'
+ '<body>'
+ '<img src="http://cdn.cad-comic.com/comics/2859286598c11964un2ya69354216.jpg" />'
+ '</body></html>';
 
var handler = new htmlparser.DefaultHandler(function(err, dom) {
  if (err) {
    alert('Error: ' + err);
  } else {
    var img = select(dom, 'img');
 
    img.forEach(function(img) {
      alert('src: ' + img.attribs.src);
    });
  }
});
 
var parser = new htmlparser.Parser(handler);
parser.parseComplete(body);

— answered 2 years ago by Robin Duckett
answer permalink
1 Comment
  • This is the way to go, I can confirm it with Titanium 1.2.2.

    Need to remember: soupselect file needs to have:

    soupselect = exports;
    not htmlparser = exports;.

    Additionally soupselect doesn't install for me as of now correctly using npm, but I've just downloaded it from github.

    Also beware, that the Ti.include needs a path, in my case:

    Ti.include('lib/htmlparser/lib/htmlparser.js')
    Ti.include('lib/soupselect/lib/soupselect.js')
    Unfortunatelly a lot of warnings when including both libs. Works as a charm however, fast even.

    — commented 1 year ago by Cezary Krzyzanowski

YQL it's the best way to parse html, as long as the webpage does not block it.

Finally, I implemented a parsing procedure based on string itself.

— answered 2 years ago by Hoseong Hwang
answer permalink
3 Comments
  • can you help me? I have the same problem...

    — commented 2 years ago by matteo annibali

  • Ciao, I have too the same problem of parsing remote HTML. How did you solve the problem? can you share your parsing procedure with us? thank you in advance

    — commented 2 years ago by Antonio Calanducci

  • I would also love to get more detail on this.

    — commented 1 year ago by nick c

Your Answer

Think you can help? Login to answer this question!