Synopsis
lua-gumbo is a HTML5 parser and DOM library for Lua. It originally started out as a set of Lua bindings for the Gumbo C library, but has now absorbed an improved fork of it.
Requirements
Installation
To install the latest release via LuaRocks use the command:
luarocks install gumbo
Usage
The gumbo module provides a parse function and a parseFile function, which both return a Document node containing a tree of descendant nodes. The structure and API of this tree mostly follows the DOM Level 4 Core specification.
For full API documentation, see: https://craigbarnes.gitlab.io/lua-gumbo/.
Example
The following is a simple demonstration of how to find an element by ID and print the contents of it’s first child text node.
local gumbo = require "gumbo"
local document = gumbo.parse('<div id="foo">Hello World</div>')
local foo = document:getElementById("foo")
local text = foo.childNodes[1].data
print(text) --> Hello WorldNote: this example omits error handling for the sake of simplicity. Production code should wrap each step with assert() or some other, application-specific error handling.
See also: https://craigbarnes.gitlab.io/lua-gumbo/#examples.
Parser API
The gumbo module provides 2 functions:
parse
Parameters:
html: A string of UTF-8 encoded HTML.options: A table of parsing options.
Returns:
Either a Document node on success, or nil and an error message on failure.
parseFile
Parameters:
pathOrFile: Either a file handle or filename string that refers to a file containing UTF-8 encoded HTML.options: A table of parsing options.
Returns:
As above.
Parsing Options
A table of options can be passed as the second argument to both parse functions. The field names and their specified effects are as follows:
tabStop: The number of columns to count for tab characters when computing source positions (defaults to8).contextElement: A string containing the name of an element to use as context for HTML fragment parsing. This is for fragment parsing only – leave asnilto parse HTML documents.contextNamespace: The namespace to use for thecontextElementparameter; either"html","svg"or"math"(defaults to"html").
DOM API
The lua-gumbo DOM API mostly follows the DOM Level 4 Core specification, with the following (intentional) exceptions:
DOMStringtypes are encoded as UTF-8 instead of UTF-16.- Lists begin at index 1 instead of 0.
readonlyis not fully enforced.
The following sections list the supported properties and methods, grouped by the DOM interface in which they are specified.
Nodes
There are 6 types of Node that may appear directly in the HTML DOM tree. These are Element, Text, Comment, Document, DocumentType and DocumentFragment. The Node type itself is just an interface, which is implemented by all 6 of the aforementioned types.
Element
Implements:
Properties:
localName- The name of the element, as a case-normalized string (lower case for all HTML elements and most other elements; camelCase for some SVG elements).
attributes- An
AttributeListcontaining anAttributeobject for each attribute of the element. namespaceURIThe canonical namespace URI of the the element.
Possible values:
Namespace namespaceURIvalueHTML "http://www.w3.org/1999/xhtml"MathML "http://www.w3.org/1998/Math/MathML"SVG "http://www.w3.org/2000/svg"tagName- The name of the element, as an upper case string.
id- The value of the element’s
idattribute, if it has one, otherwisenil. className- The value of the element’s
classattribute, if it has one, otherwisenil. innerHTML- A string containing the HTML serialization of the element’s descendants.
outerHTML- Like
innerHTML, but operating on the element itself, in addition to its descendants.
Methods:
getElementsByTagName(tagName)- Returns an
ElementListcontaining every childElementnode whoselocaNameis equal to the giventagNameargument. getElementsByClassName(classNames)- Returns an
ElementListcontaining every childElementnode that has all of the given class names. Multiple class names can be specified by passing a string with several names separated by whitespace. hasAttributes()- Returns
trueif the element has 1 or more attributes orfalseif it has none. hasAttribute(name)- Returns
trueif the element has an attribute whose name matches the givennameargument. getAttribute(name)- Returns the value of the attribute whose name matches the
nameargument ornilif no such attribute exists. setAttribute(name, value)- Sets the attribute specified with
nameto the givenvalue. This method can be used either to create a new attribute or change an existing one. removeAttribute(name)- Remove the attribute with the specified
namefrom the element.
Text
Implements:
Properties:
data- A string representing the text contents of the node.
length- The length of the
dataproperty in bytes. escapedData- TODO
Comment
Implements:
Properties:
data- A string representing the text contents of the comment node, not including the start delimiter (
<!--) or end delimiter (-->) from the original markup. length- The length of the
dataproperty in bytes.
Document
The Document node represents the outermost container of a DOM tree and is the result of parsing a single HTML document. It’s direct child nodes may include a single DocumentType node and any number of Element or Comment nodes.
Implements:
Properties:
documentElement- The root
Elementof the document (i.e. the<html>element). head- The
<head>Elementof the document. body- The
<body>Elementof the document. title- A string containing the document’s title (initially, the text contents of the
<title>element in the document markup). forms- An
ElementListof all<form>elements in the document. images- An
ElementListof all<img>elements in the document. links- An
ElementListof all<a>and<area>elements in the document that have a value for thehrefattribute. scripts- An
ElementListof all<script>elements in the document. doctype- A reference to the document’s
DocumentTypenode, if it has one, ornilif not.
Methods:
getElementById(elementId)- Returns the first
Elementnode in the tree whoseidproperty is equal to theelementIdstring. getElementsByTagName(tagName)- Returns an
ElementListcontaining every childElementnode whoselocaNameis equal to the giventagNameargument. getElementsByClassName(classNames)- Returns an
ElementListcontaining every childElementnode that has all of the given class names. Multiple class names can be specified by passing a string with several names separated by whitespace. createElement(tagName)- Creates and returns a new
Elementnode with the given tag name. createTextNode(data)- Creates and returns a new
Textnode with the given text contents. createComment(data)- Creates and returns a new
Commentnode with the given text contents. adoptNode(externalNode)- Removes a node and its subtree from another
Document(if any) and changes itsownerDocumentto the current document. The node can then be inserted into the current document tree.
DocumentType
Implements:
Properties:
name- TODO
publicId- TODO
systemId- TODO
DocumentFragment
Implements:
Properties:
TODO
Node Interfaces
Node
The Node interface is implemented by all DOM nodes.
Properties:
childNodes- A
NodeListcontaining all the children of the node. parentNode- The parent
Nodeof the node, if it has one, otherwisenil. parentElement- The parent
Elementof the node, if it has one, otherwisenil. ownerDocument- The
Documentto which the node belongs. nodeTypeAn integer code representing the type of the node.
Node type nodeTypevalueSymbolic constant Element1 Node.ELEMENT_NODEText3 Node.TEXT_NODEComment8 Node.COMMENT_NODEDocument9 Node.DOCUMENT_NODEDocumentType10 Node.DOCUMENT_TYPE_NODEDocumentFragment11 Node.DOCUMENT_FRAGMENT_NODEnodeNameA string representation of the type of the node:
Node type nodeNamevalueElementThe value of Element.tagNameText"#text"Comment"#comment"Document"#document"DocumentTypeThe value of DocumentType.nameDocumentFragment"#document-fragment"typeA string corresponding to the type of the node, as understood by the original Gumbo C API. This is subtly different from the node types in the DOM specification. The
whitespacetype is aTextnode whosedataproperty consists of only whitespace characters.typevalueNode type "element"Element"text"Text"whitespace"Text"comment"Comment"document"Document"doctype"DocumentType"fragment"DocumentFragmentNote: in DOM-specified properties and methods,
whitespaceandtextnodes are treated exactly the same. This disparity may be confusing, but thetypeproperty is retained for the sake of backwards compatibility with previous releases. It may also be useful for efficiently skipping whitespace-onlyTextnodes when traversing a document.firstChild- The first child
Nodeof the node, ornilif it has no children. lastChild- The last child
Nodeof the node, ornilif it has no children. previousSibling- The previous adjacent
Nodein the tree, ornil. nextSibling- The next adjacent
Nodein the tree, ornil. textContentIf the node is a
TextorCommentnode,textContentreturns node text (thedataproperty).If the node is a
DocumentorDocumentTypenode,textContentalways returnsnil.For other node types,
textContentreturns the concatenation of thetextContentvalue of every child node, excluding comments, or an empty string.insertedByParsertrueif the node was inserted into the DOM tree automatically by the parser andfalseotherwise.implicitEndTagtrueif the node was implicitly closed by the parser (e.g. there was no explicit end tag in the markup) andfalseotherwise.
Methods:
hasChildNodes()- Returns
trueif the node has any child nodes andfalseotherwise. contains(other)- Returns
trueifotheris an inclusive descendantNodeandfalseotherwise. appendChild(node)- Adds the
Nodepassed as thenodeparameter to the end of thechildNodeslist. insertBefore(newNode, referenceNode)Inserts
newNodeimmediately beforereferenceNodeas a child of the current node.If
referenceNodeisnil,newNodeis inserted at the end of the list of child nodes.The returned value is the inserted node.
replaceChild(newChild, oldChild)Remove
oldChildnode from thechildNodeslist and insertnewChildin its place. IfnewChildis part of an existing DOM tree, it is first removed from its parent.The returned value is the removed node.
removeChild(childNode)Removes
childNodefrom the DOM tree and returns the removed node.If the given
childNodeis not a child of the caller, the method throws an error.cloneNode(deep)- Returns a duplicate copy of the node on which the method is called. If
deepistrueall descendant nodes are also copied. walk()Returns an iterator function that, each time it is called, returns the next descendant node in the subtree, in document order. This is typically used to iterate over every node in a document via a generic
forloop. For example:local gumbo = require "gumbo" local document = assert(gumbo.parse("<p>1</p><p>2</p><p>3</p>")) for node in document.body:walk() do print(node) endNote: this method is an extension; not a part of any specification. It is similar in purpose to the
Document.createTreeWalker()method, but has a different, much simpler API.
ParentNode
The ParentNode interface is implemented by all nodes that can have children.
Properties:
children- An
ElementListof childElementnodes. childElementCount- An integer representing the number of child
Elementnodes. firstElementChild- The node’s first child
Elementif there is one, otherwisenil. lastElementChild- The node’s last child
Elementif there is one, otherwisenil.
ChildNode
The ChildNode interface is implemented by all nodes that can have a parent.
Methods:
remove()- Removes the node from it’s parent.
Attribute Objects
AttributeList
An AttributeList is a list containing zero or more Attribute objects. Every Element node has an associated AttributeList, which can be accessed via the Element.attributes property.
Note: AttributeList is equivalent to what the DOM specification calls NamedNodeMap.
Properties:
TODO
Attribute
The Attribute type represents a single attribute of an Element.
Note: Attribute is equivalent to what the DOM specification calls Attr.
Properties:
name- The name of the attribute.
value- The value of the attribute.
escapedValueThe value of the attribute, escaped according to the rules in the HTML fragment serialization algorithm.
Ampersand (
&) characters invaluebecome&, double quote (") characters become"and non-breaking spaces (U+00A0) become .This property is an extension; not a part of any specification.
Node Containers
NodeList
A list containing zero or more nodes, as typically returned by the Node.childNodes property.
Properties:
length- The number of nodes contained by the list.
ElementList
A list containing zero or more Element nodes, as typically returned by various Node and ParentNode methods.
Note: ElementList is equivalent to what the DOM specification calls HTMLCollection.
Properties:
length- The number of
Elementnodes contained by the list.
Examples
-- Prints a list of all hyperlinks in a document.
local gumbo = require "gumbo"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
for i, element in ipairs(document.links) do
print(element:getAttribute("href"))
end-- Prints the document title.
local gumbo = require "gumbo"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
print(document.title)-- Removes the element with the given id from the document tree.
local gumbo = require "gumbo"
local id = assert(arg[1], "Error: arg[1] is nil; expected element id")
local document = assert(gumbo.parseFile(arg[2] or io.stdin))
local element = document:getElementById(id)
if element then
element:remove()
end
document:serialize(io.stdout)-- Replaces all "align" attributes on td/th elements with a "style"
-- attribute containing the equivalent CSS "text-align" property.
-- This can be used to fix Pandoc HTML output.
local gumbo = require "gumbo"
local Set = require "gumbo.Set"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
local alignments = Set{"left", "right", "center", "start", "end"}
local function fixAlignAttr(element)
local align = element:getAttribute("align")
if align and alignments[align] then
local style = element:getAttribute("style")
if style then
element:setAttribute("style", style .. "; text-align:" .. align)
else
element:setAttribute("style", "text-align:" .. align)
end
element:removeAttribute("align")
end
end
for node in document.body:walk() do
if node.localName == "td" or node.localName == "th" then
fixAlignAttr(node)
end
end
document:serialize(io.stdout)-- Prints the plain text contents of a HTML file, excluding the contents
-- of code elements. This may be useful for filtering out markup from a
-- HTML document before passing it to a spell-checker or other text
-- processing tool.
local gumbo = require "gumbo"
local document = assert(gumbo.parseFile(arg[1] or io.stdin))
local codeElements = assert(document:getElementsByTagName("code"))
for i, element in ipairs(codeElements) do
element:remove()
end
io.write(document.body.textContent)