Project | WXT | Basics | Download | Documentation | Samples

PI: importhtml

The purpose of an importhtml PI is to produce a wellformed HTML-fragment and replace the PI with this fragment. WXT may attempt to do a simple tiding with jsoup [17] if parsing of the HTML-material fails. Note however that even if the tidy job is successfull the result may no be exactly as you expect.

There are basically to forms of html-import, depending on the parameter remote, see below.

You should use importhtml in stead of importxml if you want to apply a cssselector, all though it is possible to use a xpath even in importhml if you do not set the remote-parameter. The css selection is also withjsoup [17] .

<?_wxt importhtml cssselector="" location=""?>

The parameters are:

location mandatory, but optional in templates The URI of the file we want to import from. WXT will if necessary attempt to tidy the source.
In templates the location parameter is usually skipped. In this case all content files owned by the module in the script are searched for appropriate content unless you reduce the search with parameter id, see below.
encoding (optional) You can spesify expected encoding if you expect the import to be without XML-header stating the encoding. Default encoding is UTF-8 if not set otherwise in script (option: default-encoding).
id (optional) An id that match the id of the actual xmlimport in script. Has only meaning when this element has no location. One reason to use an id is processing time if you have many contentfiles to a module. Another reason may be that you have similar structures in different contentfiles and you want to be selective.
keeplinks (optional) May be yes or no. If yes, wxt will attempt to recalculate all links in the imported html. If no all links are removed. Default is yes
keepstyles (optional) May be yes or no. If no, wxt will simply remove all class- and style-attributes from HTML-tags. Default is yes.
xpath (optional) Any xpath expression that identifies a nodeset that will be treated like a XML-fragment.
cssselector (optional) A cssselector that identifies a nodeset that will be treated like a HTML-fragment. If neither xpath nor cssselector is set the selection defaults to cssselector="body p".
If both are set ,xpath is used, unless we spesify remote (see below).
remote (optional) Should be used if we try to import from a webpage over wich we have no control ( for instance a wikipedia-page).
If we use remote xpath is ignored.
usecopy (optional) May be yes or no, and is only effective for remote sources. Default is yes.
no will access the material at the original source, produce a local copy and then fetch the result from this copy. This parameter will be overrun by global option use-copy.

Examples:

<?_wxt importhtml xpath="//h1"?>
<?_wxt importhtml location="http://no.wikipedia.org/wiki/Halden" 
             cssselector="body > p" remote="yes" keeplinks="no"?>
<?_wxt importhtml 
        location="http://www.ia.hiof.no/~borres/ml/index.shtml" 
        xpath="//div[@class='main']/*"?>
<?_wxt importhtml location="../quotes/q.xml" 
        xpath="//p[@class='quot']/*"?>
    
<?_wxt importhtml location="C:\\web\\dw\\index.html" 
       cssselector="article"?>