Synopsis
lua-gumbo
is a HTML5
parser and DOM library for Lua. It originally started out as a set
of Lua bindings for the Gumbo C library, but has now absorbed an
improved fork
of it.
Requirements
Installation
To install the latest release via LuaRocks use the command:
luarocks install gumbo
Usage
The gumbo
module provides a parse
function and a parseFile
function, which both return a Document
node containing a tree of descendant
nodes. The structure and API of this tree mostly follows the DOM Level 4 Core
specification.
For full API documentation, see: https://craigbarnes.gitlab.io/lua-gumbo/.
Example
The following is a simple demonstration of how to find an element by ID and print the contents of it’s first child text node.
local gumbo = require "gumbo"
local document = gumbo.parse('<div id="foo">Hello World</div>')
local foo = document:getElementById("foo")
local text = foo.childNodes[1].data
print(text) --> Hello World
Note: this example omits error handling for the sake
of simplicity. Production code should wrap each step with assert()
or some other, application-specific error handling.
See also: https://craigbarnes.gitlab.io/lua-gumbo/#examples.
Parser API
The gumbo
module provides 2 functions:
parse
local document = gumbo.parse(html, options)
Parameters:
html
: A string of UTF-8 encoded HTML.options
: A table of parsing options.
Returns:
Either a Document
node on
success, or nil
and an error message on failure.
parseFile
local document = gumbo.parseFile(pathOrFile, options)
Parameters:
pathOrFile
: Either a file handle or filename string that refers to a file containing UTF-8 encoded HTML.options
: A table of parsing options.
Returns:
As above.
Parsing Options
A table of options can be passed as the second argument to both parse functions. The field names and their specified effects are as follows:
tabStop
: The number of columns to count for tab characters when computing source positions (defaults to8
).contextElement
: A string containing the name of an element to use as context for HTML fragment parsing. This is for fragment parsing only – leave asnil
to parse HTML documents.contextNamespace
: The namespace to use for thecontextElement
parameter; either"html"
,"svg"
or"math"
(defaults to"html"
).
DOM API
The lua-gumbo DOM API mostly follows the DOM Level 4 Core specification, with the following (intentional) exceptions:
DOMString
types are encoded as UTF-8 instead of UTF-16.- Lists begin at index 1 instead of 0.
readonly
is not fully enforced.
The following sections list the supported properties and methods, grouped by the DOM interface in which they are specified.
Nodes
There are 6 types of Node
that may appear directly in
the HTML DOM tree. These are Element
, Text
, Comment
, Document
, DocumentType
and DocumentFragment
. The Node
type itself is just an interface, which is implemented by
all 6 of the aforementioned types.
Element
Implements:
Properties:
localName
- The name of the element, as a case-normalized string (lower case for all HTML elements and most other elements; camelCase for some SVG elements).
attributes
-
An
AttributeList
containing anAttribute
object for each attribute of the element. namespaceURI
-
The canonical namespace URI of the the element.
Possible values:
Namespace namespaceURI
valueHTML "http://www.w3.org/1999/xhtml"
MathML "http://www.w3.org/1998/Math/MathML"
SVG "http://www.w3.org/2000/svg"
tagName
- The name of the element, as an upper case string.
id
-
The value of the element’s
id
attribute, if it has one, otherwisenil
. className
-
The value of the element’s
class
attribute, if it has one, otherwisenil
. innerHTML
- A string containing the HTML serialization of the element’s descendants.
outerHTML
-
Like
innerHTML
, but operating on the element itself, in addition to its descendants.
Methods:
getElementsByTagName(tagName)
-
Returns an
ElementList
containing every childElement
node whoselocaName
is equal to the giventagName
argument. getElementsByClassName(classNames)
-
Returns an
ElementList
containing every childElement
node that has all of the given class names. Multiple class names can be specified by passing a string with several names separated by whitespace. hasAttributes()
-
Returns
true
if the element has 1 or more attributes orfalse
if it has none. hasAttribute(name)
-
Returns
true
if the element has an attribute whose name matches the givenname
argument. getAttribute(name)
-
Returns the value of the attribute whose name matches the
name
argument ornil
if no such attribute exists. setAttribute(name, value)
-
Sets the attribute specified with
name
to the givenvalue
. This method can be used either to create a new attribute or change an existing one. removeAttribute(name)
-
Remove the attribute with the specified
name
from the element.
Text
Implements:
Properties:
data
- A string representing the text contents of the node.
length
-
The length of the
data
property in bytes. escapedData
- TODO
Comment
Implements:
Properties:
data
-
A string representing the text contents of the comment node,
not including the start delimiter (
<!--
) or end delimiter (-->
) from the original markup. length
-
The length of the
data
property in bytes.
Document
The Document
node represents the outermost container of
a DOM tree and is the result of parsing a single HTML document. It’s
direct child nodes may include a single DocumentType
node and any number
of Element
or Comment
nodes.
Implements:
Properties:
documentElement
-
The root
Element
of the document (i.e. the<html>
element). head
-
The
<head>
Element
of the document. body
-
The
<body>
Element
of the document. title
-
A string containing the document’s title (initially, the text
contents of the
<title>
element in the document markup). forms
-
An
ElementList
of all<form>
elements in the document. images
-
An
ElementList
of all<img>
elements in the document. links
-
An
ElementList
of all<a>
and<area>
elements in the document that have a value for thehref
attribute. scripts
-
An
ElementList
of all<script>
elements in the document. doctype
-
A reference to the document’s
DocumentType
node, if it has one, ornil
if not.
Methods:
getElementById(elementId)
-
Returns the first
Element
node in the tree whoseid
property is equal to theelementId
string. getElementsByTagName(tagName)
-
Returns an
ElementList
containing every childElement
node whoselocaName
is equal to the giventagName
argument. getElementsByClassName(classNames)
-
Returns an
ElementList
containing every childElement
node that has all of the given class names. Multiple class names can be specified by passing a string with several names separated by whitespace. createElement(tagName)
-
Creates and returns a new
Element
node with the given tag name. createTextNode(data)
-
Creates and returns a new
Text
node with the given text contents. createComment(data)
-
Creates and returns a new
Comment
node with the given text contents. adoptNode(externalNode)
-
Removes a node and its subtree from another
Document
(if any) and changes itsownerDocument
to the current document. The node can then be inserted into the current document tree.
DocumentType
Implements:
Properties:
name
- TODO
publicId
- TODO
systemId
- TODO
DocumentFragment
Implements:
Properties:
TODO
Node Interfaces
Node
The Node
interface is implemented by all DOM nodes.
Properties:
childNodes
-
A
NodeList
containing all the children of the node. parentNode
-
The parent
Node
of the node, if it has one, otherwisenil
. parentElement
-
The parent
Element
of the node, if it has one, otherwisenil
. ownerDocument
-
The
Document
to which the node belongs. nodeType
-
An integer code representing the type of the node.
Node type nodeType
valueSymbolic constant Element
1 Node.ELEMENT_NODE
Text
3 Node.TEXT_NODE
Comment
8 Node.COMMENT_NODE
Document
9 Node.DOCUMENT_NODE
DocumentType
10 Node.DOCUMENT_TYPE_NODE
DocumentFragment
11 Node.DOCUMENT_FRAGMENT_NODE
nodeName
-
A string representation of the type of the node:
Node type nodeName
valueElement
The value of Element.tagName
Text
"#text"
Comment
"#comment"
Document
"#document"
DocumentType
The value of DocumentType.name
DocumentFragment
"#document-fragment"
type
-
A string corresponding to the type of the node, as understood by the original Gumbo C API. This is subtly different from the node types in the DOM specification. The
whitespace
type is aText
node whosedata
property consists of only whitespace characters.type
valueNode type "element"
Element
"text"
Text
"whitespace"
Text
"comment"
Comment
"document"
Document
"doctype"
DocumentType
"fragment"
DocumentFragment
Note: in DOM-specified properties and methods,
whitespace
andtext
nodes are treated exactly the same. This disparity may be confusing, but thetype
property is retained for the sake of backwards compatibility with previous releases. It may also be useful for efficiently skipping whitespace-onlyText
nodes when traversing a document. firstChild
-
The first child
Node
of the node, ornil
if it has no children. lastChild
-
The last child
Node
of the node, ornil
if it has no children. previousSibling
-
The previous adjacent
Node
in the tree, ornil
. nextSibling
-
The next adjacent
Node
in the tree, ornil
. textContent
-
If the node is a
Text
orComment
node,textContent
returns node text (thedata
property).If the node is a
Document
orDocumentType
node,textContent
always returnsnil
.For other node types,
textContent
returns the concatenation of thetextContent
value of every child node, excluding comments, or an empty string. insertedByParser
-
true
if the node was inserted into the DOM tree automatically by the parser andfalse
otherwise. implicitEndTag
-
true
if the node was implicitly closed by the parser (e.g. there was no explicit end tag in the markup) andfalse
otherwise.
Methods:
hasChildNodes()
-
Returns
true
if the node has any child nodes andfalse
otherwise. contains(other)
-
Returns
true
ifother
is an inclusive descendantNode
andfalse
otherwise. appendChild(node)
-
Adds the
Node
passed as thenode
parameter to the end of thechildNodes
list. insertBefore(newNode, referenceNode)
-
Inserts
newNode
immediately beforereferenceNode
as a child of the current node.If
referenceNode
isnil
,newNode
is inserted at the end of the list of child nodes.The returned value is the inserted node.
replaceChild(newChild, oldChild)
-
Remove
oldChild
node from thechildNodes
list and insertnewChild
in its place. IfnewChild
is part of an existing DOM tree, it is first removed from its parent.The returned value is the removed node.
removeChild(childNode)
-
Removes
childNode
from the DOM tree and returns the removed node.If the given
childNode
is not a child of the caller, the method throws an error. cloneNode(deep)
-
Returns a duplicate copy of the node on which the method is called. If
deep
istrue
all descendant nodes are also copied. walk()
-
Returns an iterator function that, each time it is called, returns the next descendant node in the subtree, in document order. This is typically used to iterate over every node in a document via a generic
for
loop. For example:local gumbo = require "gumbo" local document = assert(gumbo.parse("<p>1</p><p>2</p><p>3</p>")) for node in document.body:walk() do print(node) end
Note: this method is an extension; not a part of any specification. It is similar in purpose to the
Document.createTreeWalker()
method, but has a different, much simpler API.
ParentNode
The ParentNode
interface is implemented by all nodes that can have children.
Properties:
children
-
An
ElementList
of childElement
nodes. childElementCount
-
An integer representing the number of child
Element
nodes. firstElementChild
-
The node’s first child
Element
if there is one, otherwisenil
. lastElementChild
-
The node’s last child
Element
if there is one, otherwisenil
.
ChildNode
The ChildNode
interface is implemented by all nodes that can have a parent.
Methods:
remove()
- Removes the node from it’s parent.
Attribute Objects
AttributeList
An AttributeList
is a list containing zero or more Attribute
objects. Every Element
node has an associated
AttributeList
, which can be accessed via the
Element.attributes
property.
Note: AttributeList
is equivalent to what the
DOM specification calls NamedNodeMap
.
Properties:
TODO
Attribute
The Attribute
type represents a single attribute of an
Element
.
Note: Attribute
is equivalent to what the DOM
specification calls Attr
.
Properties:
name
- The name of the attribute.
value
- The value of the attribute.
escapedValue
-
The value of the attribute, escaped according to the rules in the HTML fragment serialization algorithm.
Ampersand (
&
) characters invalue
become&
, double quote ("
) characters become"
and non-breaking spaces (U+00A0
) become
.This property is an extension; not a part of any specification.
Node Containers
NodeList
A list containing zero or more nodes, as
typically returned by the Node.childNodes
property.
Properties:
length
- The number of nodes contained by the list.
ElementList
A list containing zero or more Element
nodes, as typically returned by
various Node
and ParentNode
methods.
Note: ElementList
is equivalent to what the DOM
specification calls HTMLCollection
.
Properties:
length
-
The number of
Element
nodes contained by the list.
Examples
-- Prints a list of all hyperlinks in a document.
local gumbo = require "gumbo"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
for i, element in ipairs(document.links) do
print(element:getAttribute("href"))
end
-- Prints the document title.
local gumbo = require "gumbo"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
print(document.title)
-- Removes the element with the given id from the document tree.
local gumbo = require "gumbo"
local id = assert(arg[1], "Error: arg[1] is nil; expected element id")
local document = assert(gumbo.parseFile(arg[2] or io.stdin))
local element = document:getElementById(id)
if element then
:remove()
elementend
:serialize(io.stdout) document
-- Replaces all "align" attributes on td/th elements with a "style"
-- attribute containing the equivalent CSS "text-align" property.
-- This can be used to fix Pandoc HTML output.
local gumbo = require "gumbo"
local Set = require "gumbo.Set"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
local alignments = Set{"left", "right", "center", "start", "end"}
local function fixAlignAttr(element)
local align = element:getAttribute("align")
if align and alignments[align] then
local style = element:getAttribute("style")
if style then
:setAttribute("style", style .. "; text-align:" .. align)
elementelse
:setAttribute("style", "text-align:" .. align)
elementend
:removeAttribute("align")
elementend
end
for node in document.body:walk() do
if node.localName == "td" or node.localName == "th" then
(node)
fixAlignAttrend
end
:serialize(io.stdout) document
-- Prints the plain text contents of a HTML file, excluding the contents
-- of code elements. This may be useful for filtering out markup from a
-- HTML document before passing it to a spell-checker or other text
-- processing tool.
local gumbo = require "gumbo"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
local codeElements = assert(document:getElementsByTagName("code"))
for i, element in ipairs(codeElements) do
:remove()
elementend
io.write(document.body.textContent)