HTMLevator is a series of enhancement utilities for improving HTML documents. These tools can be used on HTML files made from .docx
s with XSweet, or on other, arbitrary HTML files. HTMLevator features include:
HTMLevator includes a feature that attempts to infer which elements are headings, transforming them from <p>
s into headings: <h1>
through <h6>
. This is more art than science, as the input is generally not semantically tagged and structured. It is sometimes trivial to infer headers but it is also frequently quite difficult or impossible to do so unassisted or programmatically. As such, heading promotion will not catch all headings all the time, and it will work better on some documents than on others.
There are 3 heading promotion strategies built into XSweet:
The header-promote/header-promotion-CHOOSE.xsl
sheet will try to pick the best approach to use for a given document:
header-promotion-CHOOSE.xsl
checks to see whether outline levels appear to have been used. If outline level data exists, it is used as the basis for heading promotionAlternatively, you can specify the header promotion method to use by passing it as a runtime parameter with header-promotion-CHOOSE.xsl
:
method=ranked-format
method=outline-level
method=my-styles.xml
As a rule, authors indicate headings with visual formatting far more commonly than by applying named MS Word styles. It’s not possible to have a discrete list of what kind of formatting indicates a heading, as it changes from file to file and is highly contextual. Instead, each individual document and its formatting must be analyzed as a whole before making guesses about headings. Format-based heading promotion does just this.
This approach works well for some documents and poorly for others. One size does not fit all, and the approach is simply to optimize for what works well with the greatest number of documents. Table of contents and reference files often contain many short paragraphs, leading to erroneous heading promotion.
The header-promote/digest-paragraphs.xsl
sheet performs this file analysis. It makes a representation of every <p>
in the document with relevant formatting properties:
font-size
font-style
font-weight
text-decoration
color
text-align
Next, it sorts paragraphs into groups that share identical formatting, one group for each distinct combination of properties. These groups are candidates for promotion from <p>
to <h1-6>
. HTMLevator considers:
Decisions about what to consider headings are made as follows:
<p>
s with the same styling suggest the paragraphs aren’t headings), ANDAfter HTMLevator has identified paragraph groups to mark as headings, it must guess the outline level. It does so based on the following attributes, in these order:
Generally speaking, HTMLevator’s heading detection does a better job detecting headings than it does at guessing the heading’s level.
This is the default heading promotion method, run if outline level data is not present. You can also run this method=ranked-format
header-promote/digest-paragraphs.xsl
makes the paragraph groupings, and guesses what formats should be headings (and what level those headings should be).header-promote/make-header-escalator-xslt.xsl
sheet uses the digest-paragraphs.xsl
output as its input, which it uses to produce a bespoke XSL
sheet.<p>
s thought to be headings with <h1-6>
.An outline level can be specified on a paragraph in Word (which often comes from a named Word style. Some writers use this outlining functionality in Word, either deliberately, or implicitly through careful use of named styles. In these instances, outline levels are often a reliable indicator of headings and heading levels.
When outline levels are specified in Word’s XML (e.g. <w:outlineLvl w:val="0"/>
), they are extracted by XSweet as an -xsweet-outline-level
property on the <p>
.
When this property is present at least twice in the HTML document, the header-promote/header-promotion-CHOOSE.xsl
sheet will elect to use outline levels to promote headings.
To create a custom configuration:
my-styles.xml
or what have you). See the example provided in config-mockup.xml
for syntax.header-promotion-CHOOSE.xsl
sheet, passing the custom mapping .xml
sheet as a runtime parameter (method=my-styles.xml
)make-header-mapper-xslt.xsl
will generate and apply custom XSL sheet based on your XML filehyperlink-inferencer/hyperlink-inferencer.xsl
This sheet searches for plain-text URLs and automatically links them. It can recognize links with the following TLDs:
XSweet looks for a top level domain preceded by preceded by one or more strings that contain only letters, numbers, underscores and dashes (no spaces or other punctuation). These strings can be separated by periods (".") Note that this rule will capture a www.
if it is present.
XSweet will recognizes and include in the link the protocol, if it has been specified (http://
, https://
, ftp:
). If the protocol has not been specified, the link’s href
will be appended with http://
.
This sheet will also capture query strings on links.
DETECT-ITEMIZE-LISTS.xsl
This module will recognize plain text that looks like a numbered lists and mark the corresponding list (as an <ol>
) and list items (<li>
s).
DETECT-ITEMIZE-LISTS.xsl
runs from within it 3 separate sheets in sequence:
detect-numbered-lists.xsl
, which detects lists and bookends them with <xsw:list xmlns:xsw="http://coko.foundation/xsweet" level="0">
itemize-detected-lists.xsl
, which converts the <xsw:list>
tags to <ol>
, and wraps each paragraph in <li>
sscrub-literal-numbering-lists.xsl
, which removes from each list item the leading whitespace, literal text numbering, the period, and the whitespace after itLists must match the following pattern to be detected and marked as a numbered list:
List items that meet this criteria are scrubbed of their literal numbering (and following white space) in favor of automatically generated <ol>
numbering.
Note that this feature creates a flat list (one level), rather than nested lists based on indentation.
This module can be run before or after the PROMOTE-lists.xsl
feature in XSweet Core. To use it, you can modify the execute_chain.sh
file of the XSweet_runner_scripts to include this step before the final-rinse.xsl
step.
See also the documentation for marked list handling.
ucp-cleanup/ucp-text-macros.xsl
This sheet contains a suite of text cleanups, built specifically for use by the University of California Press. It automates many copyediting improvements:
Hyphens between numerals are converted to en dashes
Two or more consecutive spaces are converted to a single space
Any number of spaces before or after em dashes are removed
Series of periods are converted to ellipses
Two adjacent hyphens become an em dash
En dashes surrounded on both sides by spaces are converted to an em dash
Equal signs are normalized to be surrounded by one space on either side
Spaces adjacent to tabs are removed
Spaces at the beginning and end of paragraphs are removed
Tabs at the end of paragraphs are removed
Empty paragraphs are removed
Single and double quotation marks (including backticks) are converted to directional quotation marks
Hair spaces are inserted between single and double quotation marks
Punctuation marks are coerced to match the formatting of the previous word; e.g. <i>extraordinary</i>!
becomes <i>extraordinary!</i>
. This rule applies to the following punctuation marks:
ucp-cleanup/ucp-mappings.xsl
In this step, underlining and bolding is converted to italics, either as inline tags or style
CSS:
<b>
s and <u>
s are replaced with <i>
sstyle="font-weight: bold"
and style="text-decoration: underline"
become style="font-style: italic"
Short and sweet.
The files in the html-tweak
folder can be used to extend XSweet, by defining custom transformations to apply to the text. This can be done on a per-document basis, or to implement generic rules according to your use case.
Use is as follows:
.xml
fileAPPLY-html-tweaks.xsl
sheet, referencing the above transformations defined in your xml
file. This:
(A) reads the user-defined transformations from your .xml
file
(B) creates a new XSL sheet based on the .xml
file that will implement the specified transformation (done with the make-html-tweak-xslt.xsl
sheet)
(C) applies the created XSL sheet to the input fileExample use (exact script will depend upon how you are running your XSLT:
XSLT my-source.html APPLY-html-tweaks.xsl config=my-html-tweaks.xml
The user-specified tweaks work by establishing matches between categories of HTML elements (most commonly but certainly not limited to <p>
s or <span>
s), as indicated by:
style
attribute), orclass
attribute)The syntax to define HTML tweaks uses the following components:
where
: a wrapper for a rulematch
: conditions on an element for it to matchstyle
: a style
property name or property-name: value
combinationclass
: a class value (name token)Remove Default
classes from HTML elements where they appear:
<p class="Default">Here is default class paragraph</p>
becomes:
<p>Here is default class paragraph</p>
HTML tweak rule:
<where>
<match><class>Default</class></match>
<remove><class>Default</class></remove>
</where>
Remove a specific styling property wherever it’s present:
<p style="text-indent:1em; margin-bottom: 1em">Styling includes a property</p>
becomes:
<p style="text-indent:1em">Styling includes a property</p>
HTML tweak rule:
<where>
<match><style>margin-bottom</style></match>
<remove><style>margin-bottom</style></remove>
</where>
Remove a style
property if it has a given value:
<p style="font-family: Helvetica; font-size: 12pt">Remove a property if it has a specific value</p>
becomes:
<p style="font-size: 12pt">Remove a property if it has a specific value</p>
HTML tweak rule:
<where>
<match><style>font-family: Helvetica</style></match>
<remove><style>font-family</style></remove>
</where>
The following tweak rule will map a specific class
and style
to another class
and style
:
<where>
<match>
<style>font-size: 18pt</style>
<class>FreeForm</class>
</match>
<remove>
<style>font-size</style>
<class>FreeForm</class>
</remove>
<add>
<class>FreeFormNew</class>
<style>color: red</style>
</add>
</where>
For further examples, see the demo files included in the repository:
html-tweak-map.xml
defines example transformation definitionshtml-tweak-demo.xsl
is the resulting XSL sheet made by the make-html-tweak-xslt.xsl
, which will effect the specified transformation. (This relies on the html-tweak-lib.xsl
file as a dependency)This utility uses headings (<h1-6>
) as markers and attempts to add <section>
s to an HTML file. It is run as a single XSL sheet, induce-sections/induce-sections.xsl
, which returns the document HTML file unchanged except for the addition of <section>
tags.
<section>
s<div class="docx-body">
for this sheet to work. It will be wrapped if it has been extracted by XSweet; otherwise you will have to add this element yourself<!-- Headers out of regular order: h1, h2, h3, h1, h3-->
Example:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"></meta>
<title>sections</title>
</head>
<body>
<div class="docx-body">
<h1>h1</h1>
<p>h1 para</p>
## h2
<p>h2 para</p>
### h3
<p>h3 para</p>
<p>h3 para</p>
<h1>h1</h1>
<p>h1 para</p>
### h3
<p>h3 para</p>
</div>
</body>
</html>
becomes
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>sections</title>
</head>
<body>
<div class="docx-body">
<!-- Headers out of regular order: h1, h2, h3, h1, h3-->
<section>
<h1>h1</h1>
<p>h1 paragraph</p>
<section>
## h2
<p>h2 para</p>
<section>
### h3
<p>h3 para</p>
<p>h3 para</p>
</section>
</section>
</section>
<section>
<h1>h1</h1>
<p>h1 paragraph</p>
<section>
### h3
<p>h3 para</p>
</section>
</section>
</div>
</body>
</html>
mark-sections.xsl
and nest-sections.xsl
are deprecated; the induce-sections.xsl
sheet encapsulates the functionality from both.