html

stable

Parse and manipulate HTML documents using CSS selectors, and escape or create HTML elements.

use plugin html::{parse_select, parse_select_text, parse_select_attr, …}

18 functions Web

/ filter jk navigate Esc clear

Functions (18)

parse_select Select elements by CSS selector, return HTML
parse_select_text Select elements, return text content
parse_select_attr Select elements, return attribute values
extract_links Extract all anchor links with text and href
extract_images Extract all images with src and alt
text_content Get all text from entire document
select_nth Get text of nth CSS selector match
select_count Count elements matching a selector
outer_html Get full outer HTML of matching elements
extract_meta Extract all meta tag name/content pairs
extract_title Extract the page title string
extract_scripts Extract script tags with src and inline code
extract_styles Extract stylesheet links and inline CSS
strip_tags Strip all HTML tags, return plain text
extract_tables Extract HTML tables as nested arrays
escape Escape special characters as HTML entities
unescape Decode HTML entities back to characters
create_element Build an HTML element string

Overview

html is a stateless HTML toolkit built on a real CSS-selector engine, so you work with documents the same way a browser or scraper would: feed in an HTML string, query it with familiar selectors like "a", ".card", or "link[rel='stylesheet']", and get back plain Zolo strings and tables. There are no handles or objects to manage — every function takes the document as a string argument and returns ordinary values, so each call is independent and re-parses what it needs.

The functions fall into three groups: selector queries (parse_select, parse_select_text, parse_select_attr, select_nth, select_count, outer_html) that pull elements out of a document; high-level extractors (extract_links, extract_images, extract_meta, extract_title, extract_scripts, extract_styles, extract_tables, text_content, strip_tags) that return structured data for common page parts; and string builders (escape, unescape, create_element) for safely producing HTML. Reach for it whenever you need to scrape, inspect, or assemble HTML without an external parser.

Common patterns

Scrape a navigation menu by pulling every link with its text and target:

use plugin html::{extract_links, select_count}

let page = "<nav><a href='/home'>Home</a><a href='/blog'>Blog</a></nav>"
let links = extract_links(page)
print("found {select_count(page, "a")} links")
for link in links {
  print("{link["text"]} -> {link["href"]}")
}

Read a page's metadata in one pass — title plus every <meta> tag:

use plugin html::{extract_title, extract_meta}

let page = "<head><title>Cats</title><meta name='description' content='All about cats'></head>"
print("title: {extract_title(page)}")
for m in extract_meta(page) {
  print("{m["name"]} = {m["content"]}")
}

Build an element safely by escaping untrusted text before nesting it:

use plugin html::{escape, create_element}

let comment = escape("<b>hi</b> & bye")
let safe = create_element("p", #{"class": "comment"}, comment)
print(safe)

parse_select(html, selector) → table

Select elements by CSS selector, return HTML

Parses the HTML document and returns a table of outer HTML strings for each element matching the CSS selector. Keys are 1-indexed integers.

use plugin html::{parse_select}

let doc = "<ul><li>Alice</li><li>Bob</li></ul>"
let items = parse_select(doc, "li")
print(items[1])
print(items[2])

parse_select_text(html, selector) → table

Select elements, return text content

Like parse_select but returns the text content of each matched element instead of its HTML.

use plugin html::{parse_select_text}

let doc = "<div><p>Hello <b>World</b></p><p>Goodbye</p></div>"
let texts = parse_select_text(doc, "p")
print(texts[1])

parse_select_attr(html, selector, attr_name) → table

Select elements, return attribute values

Returns a table of attribute values for a given attribute name across all elements matching the selector. Elements without the attribute are skipped.

use plugin html::{parse_select_attr}

let doc = "<a href='/home'>Home</a><a href='/about'>About</a>"
let hrefs = parse_select_attr(doc, "a", "href")
print(hrefs[1])
print(hrefs[2])

Use any attribute name and a more specific selector to read, say, image sources inside a gallery:

use plugin html::{parse_select_attr}

let doc = "<div class='gallery'><img src='1.png'><img src='2.png'></div>"
let srcs = parse_select_attr(doc, ".gallery img", "src")
print(srcs[1])

extract_links(html) → table

Extract all anchor links with text and href

Returns a table of {text, href} tables for every <a href="..."> element in the document.

use plugin html::{extract_links}

let doc = "<a href='https://example.com'>Example</a>"
let links = extract_links(doc)
print(links[1]["text"])
print(links[1]["href"])

extract_images(html) → table

Extract all images with src and alt

Returns a table of {src, alt} tables for every <img> element in the document.

use plugin html::{extract_images}

let doc = "<img src='logo.png' alt='Logo'><img src='banner.jpg' alt=''>"
let imgs = extract_images(doc)
print(imgs[1]["src"])

text_content(html) → string

Get all text from entire document

Concatenates all text nodes in the document and returns a single string.

use plugin html::{text_content}

let doc = "<h1>Title</h1><p>Body text here.</p>"
let text = text_content(doc)
print(text)

select_nth(html, selector, n) → string

Get text of nth CSS selector match

Returns the text content of the nth match (1-indexed) of the CSS selector. Returns nil if there is no nth match.

use plugin html::{select_nth}

let doc = "<ul><li>First</li><li>Second</li><li>Third</li></ul>"
let second = select_nth(doc, "li", 2)
print(second)

select_count(html, selector) → number

Count elements matching a selector

Counts how many elements in the document match the given CSS selector.

use plugin html::{select_count}

let doc = "<p>One</p><p>Two</p><p>Three</p>"
let n = select_count(doc, "p")
print("Paragraph count: {n}")

Pair it with a class selector to check whether a page contains a given widget before doing more work:

use plugin html::{select_count}

let doc = "<div class='alert'>Warning</div><div class='alert'>Error</div>"
if select_count(doc, ".alert") > 0 {
  print("page has alerts")
}

outer_html(html, selector) → table

Get full outer HTML of matching elements

Returns a table of full outer HTML strings (including the element tag itself) for each element matching the selector.

use plugin html::{outer_html}

let doc = "<div class='card'><span>Hi</span></div>"
let results = outer_html(doc, ".card")
print(results[1])

extract_meta(html) → table

Extract all meta tag name/content pairs

Returns a table of {name, content} tables for every <meta> tag, using the name or property attribute as the key.

use plugin html::{extract_meta}

let doc = "<meta name='description' content='Page about cats'><meta property='og:title' content='Cats'>"
let metas = extract_meta(doc)
print(metas[1]["name"])
print(metas[1]["content"])

extract_title(html) → string

Extract the page title string

Returns the text content of the first <title> element, or an empty string if none exists.

use plugin html::{extract_title}

let doc = "<html><head><title>My Page</title></head><body></body></html>"
let title = extract_title(doc)
print(title)

extract_scripts(html) → table

Extract script tags with src and inline code

Returns a table of {src, inline_code} tables for every <script> tag. External scripts have src set; inline scripts have inline_code set.

use plugin html::{extract_scripts}

let doc = "<script src='/app.js'></script><script>console.log('hi')</script>"
let scripts = extract_scripts(doc)
print(scripts[1]["src"])
print(scripts[2]["inline_code"])

extract_styles(html) → table

Extract stylesheet links and inline CSS

Returns a table of {href, inline_css} tables for every <link rel="stylesheet"> and <style> element.

use plugin html::{extract_styles}

let doc = "<link rel='stylesheet' href='/style.css'><style>body { margin: 0 }</style>"
let styles = extract_styles(doc)
print(styles[1]["href"])
print(styles[2]["inline_css"])

strip_tags(html) → string

Strip all HTML tags, return plain text

Removes all HTML tags and returns normalized plain text with whitespace collapsed.

use plugin html::{strip_tags}

let doc = "<h1>Hello</h1>  <p>World  and  beyond.</p>"
let plain = strip_tags(doc)
print(plain)

extract_tables(html) → table

Extract HTML tables as nested arrays

Extracts all <table> elements as nested arrays. Returns a table of tables of rows, where each row is a table of cell text strings.

use plugin html::{extract_tables}

let doc = "<table><tr><th>Name</th><th>Age</th></tr><tr><td>Alice</td><td>30</td></tr></table>"
let tables = extract_tables(doc)
let row1 = tables[1][1]
print(row1[1])
print(row1[2])

escape(text) → string

Escape special characters as HTML entities

Escapes &, <, >, ", and ' into their HTML entity equivalents, safe for inserting into HTML.

use plugin html::{escape}

let safe = escape("<script>alert('xss')</script>")
print(safe)

Escape user input before interpolating it into markup you build by hand:

use plugin html::{escape}

let name = "Tom & \"Jerry\""
print("<span>{escape(name)}</span>")

unescape(text) → string

Decode HTML entities back to characters

Decodes common HTML entities (&, <, >, ", ',  , etc.) back to their original characters.

use plugin html::{unescape}

let raw = unescape("Tom &amp; Jerry &lt;3")
print(raw)

create_element(tag, attrs, inner_html) → string

Build an HTML element string

Builds an HTML element string. Pass a table for attrs (string key/value pairs) and a string for inner_html. Void tags (br, img, input, etc.) are self-closed.

use plugin html::{create_element}

let link = create_element("a", #{"href": "/home", "class": "nav"}, "Home")
print(link)

let br = create_element("br", nil, nil)
print(br)

View source code