NAME
HTML::HTML5::Outline - implementation of the HTML5 Outline algorithm
SYNOPSIS
use JSON;
use HTML::HTML5::Outline;
my $html = <<'HTML';
Hello
World
Good Morning
Vietnam
HTML
my $outline = HTML::HTML5::Outline->new($html);
print to_json($outline->to_hashref, {pretty=>1,canonical=>1});
DESCRIPTION
This is an implementation of the HTML5 Outline algorithm, as per
.
The module can output a JSON-friendly hashref, or an RDF model.
Constructor
* "HTML::HTML5::Outline->new($html, %options)"
Construct a new outline. $html is the HTML to generate an outline
from, either as an HTML or XHTML string, or as an
XML::LibXML::Document object.
Options:
* default_language - default language to assume text is in when no
lang/xml:lang attribute is available. e.g. 'en-gb'.
* element_subjects - rather advanced feature that doesn't bear
explaining. See USE WITH RDF::RDFA::PARSER for an example.
* microformats - support "", ""
and "" as sectioning elements (like
"", "", etc). Boolean, defaults to false.
* parser - 'html' (default) or 'xml' - choose the parser to use
for XHTML/HTML. If the constructor is passed an
XML::LibXML::Document, this is ignored.
* suppress_collections - allows rdf:List stuff to be suppressed
from RDF output. RDF output - especially in Turtle format -
looks somewhat nicer without them, but if you care about the
order of headings and sections, then you'll want them. Boolean,
defaults to false.
* uri - the document URI for resolving relative URI references.
Only really used by the RDF output.
Object Methods
* "to_hashref"
Returns data as a nested hashref/arrayref structure. Dump it as JSON
and you'll figure out the format pretty easily.
* "to_rdf"
Returns data as a n RDF::Trine::Model. Requires RDF::Trine to be
installed. Otherwise this method won't exist.
* "primary_outlinee"
Returns a HTML::HTML5::Outline::Outlinee element representing the
outline for the page.
Class Methods
* "has_rdf"
Indicates whether the "to_rdf" object method exists.
USE WITH RDF::RDFA::PARSER
This module produces RDF data where many of the resources described are
HTML elements. RDFa data typically does not, but RDF::RDFa::Parser does
also support some extensions to RDFa which do (e.g. support for the
"cite" and "role" attributes). It's useful to combine the RDF data from
each, and RDF::RDFa::Parser 1.093 and upwards contains a few shims to
make this possible.
Without further ado...
use HTML::HTML5::Outline;
use RDF::RDFa::Parser 1.093;
use RDF::TrineShortcuts;
my $rdfa = RDF::RDFa::Parser->new(
$html_source,
$base_url,
RDF::RDFa::Parser::Config->new(
'html5', '1.1',
role_attr => 1,
cite_attr => 1,
longdesc_attr => 1,
),
)->consume;
my $outline = HTML::HTML5::Outline->new(
$rdfa->dom,
uri => $rdfa->uri,
element_subjects => $rdfa->element_subjects,
);
# Merging two graphs is pretty complicated in RDF::Trine
# but a little easier with RDF::TrineShortcuts...
my $combined = rdf_parse();
rdf_parse($rdfa->graph, model => $combined);
rdf_parse($outline->to_rdf, model => $combined);
my $NS = {
dc => 'http://purl.org/dc/terms/',
o => 'http://ontologi.es/outline#',
type => 'http://purl.org/dc/dcmitype/',
xs => 'http://www.w3.org/2001/XMLSchema#',
xhv => 'http://www.w3.org/1999/xhtml/vocab#',
};
print rdf_string($combined => 'Turtle', namespaces => $NS);
SEE ALSO
HTML::HTML5::Outline::RDF, HTML::HTML5::Outline::Outlinee,
HTML::HTML5::Outline::Section.
HTML::HTML5::Parser, HTML::HTML5::Sanity.
AUTHOR
Toby Inkster,
ACKNOWLEDGEMENTS
This module is a fork of the document structure parser from Swignition
.
That in turn includes the following credits: thanks to Ryan King and
Geoffrey Sneddon for pointing me towards [the HTML5] algorithm. I also
used Geoffrey's python implementation as a crib sheet to help me figure
out what was supposed to happen when the HTML5 spec was ambiguous.
COPYRIGHT AND LICENCE
Copyright (C) 2008-2011 by Toby Inkster
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.