[More quick guides]

A quick guide to
the boxtracking API
in Prince

Think inside the boxes

Prince is a HTML-to-PDF-via-CSS converter which offers advanced layout capabilities. In order to achieve perfection, it is sometimes necessary to analyze the output of the formatting process. You can do so through a JavaScript API, and this guide will show you how. Some knowledge of HTML, CSS and JavaScript is required to follow this tutorial. For a more formal description of the boxtracking API, see the Prince documentation.

Here's a minimal, but complete, document which says hello to the boxtracking API:

html<!DOCTYPE html><html><meta charset=utf-8>
<script>
Prince.trackBoxes = true;
Prince.registerPostLayoutFunc(countpages);

function countpages() {
  var pages = PDF.pages;
  console.log("Number of pages: "+pages.length);
}
</script>
<body>
<p>Hello world!
</html>

To format the document yourself, download and install Prince, then run this in a console:

prince -j https://css4.pub/2023/boxtracking/sample-1.html -o foo.pdf

When Prince runs, you will see this in your console:

Number of pages: 1

You can process all examples in this guide in this manner; the source code is linked on the right side of the box.

Looking inside

In the first example we just counted the number of elements in the PDF.pages array; this number corresponds to the number of pages in the formatted document. If we look inside the PDF.pages array, we can find all boxes created by Prince in the formatting process:

htmlPrince.trackBoxes = true;
Prince.registerPostLayoutFunc(analyze);

function lookwithin(str,box) {
  console.log(str+box.type); 

  for (var i=0; i<box.children.length; i++) {
    lookwithin(str+"  ",box.children[i]);
  }
}

function analyze() {
  var pages = PDF.pages;

  for (var i = 0; i<pages.length; ++i) {
     console.log("Boxes on page "+(i+1)+":");
     lookwithin("  ", pages[i]);
  }
}

When processed, Prince logs:

Boxes on page 1:
  BODY
    BOX
      BOX
        BOX
          LINE
            TEXT
          LINE
            TEXT

Prince generates different kinds of boxes: BODY, BOX, LINE, TEXT. Boxes are organized in tree structure, similiar to how HTML elements are organized. The structure will often be similar, but one element can have many boxes due to line breaks and page breaks.

The boxtracking API allows you to analyze this tree of boxes. In the example above, the analyze function is started after the formatting process has been completed.

Finding tag names and text

By digging a little deeper, we can also find the elements that generated the boxes, and their content:

htmlfunction lookwithin(str,box) {
  if (box.type=="BOX") { 
     console.log(str+"BOX created by "+box.element.tagName+" element"); 
  } else if (box.type=="TEXT") { 
     console.log(str+"TEXT with content: "+box.text); 
  } else {
     console.log(str+box.type); 
  }

  for (var i=0; i<box.children.length; i++) {
    lookwithin(str+"  ",box.children[i]);
  }
}

function analyze() {
  var pages = PDF.pages;

  for (var i = 0; i<pages.length; ++i) {
     console.log("Boxes on page "+(i+1)+":");
     lookwithin("  ", pages[i]);
  }
}

When processed, Prince logs:

Boxes on page 1:
  BODY
    BOX created by HTML element
      BOX created by BODY element
        BOX created by P element
          LINE
            TEXT with content: Hello
          LINE
            TEXT with content: world!

As you can see, it is quite simple to extract information from the tree of boxes.

End of the line

Consecutive lines ending with the same word may be sub-optimal. Here's a simple script that detects such lines:

html  if (box.type =="TEXT") { 
     var n = box.text.split(" ");
     var lastWord = n[n.length - 1];

     if (lastWord == previousLastWord) {
        console.log("LINE ENDING ALERT ON PAGE "+box.pageNum+": "+lastWord); 
     }
     previousLastWord = lastWord;
  }

When processed, Prince logs:

LINE ENDING ALERT ON PAGE 1: world.

It should be noted that this example does not work when the text is justified. This is due to the Boxtracking API not reporting word breaks when text is justified.

Where are my elements?

A common task for the boxtracking API is to track boxes generated by certain elements. In this example, a CSS selector (p.foo, p.bar) indicates which elements to track, and the Prince-specific getPrinceBoxes function finds the corresponding boxes.

html<p class=foo>Hello world!
<p>Hello world!
<p class=bar>Lorem ipsum...

The script extracts the x/y/width/height for all boxes belonging to matching elements:

tagname:  P  classname:  foo
  x:  15.874015748031498
  y:  210.8976377952756
  width:  195.02362204724412
  height:  27.77952755905511
  page:  1
tagname:  P  classname:  bar
  x:  15.874015748031498
  y:  155.3385826771654
  width:  195.02362204724412
  height:  138.89763779527556
  page:  1
  x:  15.874015748031498
  y:  210.8976377952756
  width:  195.02362204724412
  height:  194.45669291338578
  page:  2
  x:  15.874015748031498
  y:  210.8976377952756
  width:  195.02362204724412
  height:  83.33858267716533
  page:  3

Color me red (but only on page 2, please)

In the previous examples, we have analyzed boxes without making any changes. In this example we will change the styling of certain elements.

The script below starts by looking for all p elements in the document. It then uses the Prince-specific getPrinceBoxes function to find the boxes that belong to the elements. If a box appears on page 2, the color of the corresponding p element will be set to red:

htmlfunction setColor() {
  var elements = document.querySelectorAll("p");
  for(var i=0; i<elements.length; i++) { 
      var boxes = elements[i].getPrinceBoxes();
      for(var j=0; j<boxes.length; j++) { 
         if (boxes[j].pageNum == 2) {
            boxes[j].element.style.color="red";
         }
      }
   }
}

When the script makes changes to the element structure (in this case by setting the color to red), Prince will automatically rerun the formatting process, and we therefore see red text on the second page.

One common question about the boxtracking API is: how can I change the color (or other properties) of a box? This is not possible.

Elements vs boxes

This example uses the same script as above; it looks for all p elements and the corresponding boxes. However, in this case there is only one p element which is split into several boxes, each on a different page. Therefore, when the color of the element is changed, all boxes generated by the P element will be red.

htmlfunction setColor() {
  var elements = document.querySelectorAll("p");
  for(var i=0; i<elements.length; i++) { 
      var boxes = elements[i].getPrinceBoxes();
      for(var j=0; j<boxes.length; j++) { 
         if (boxes[j].pageNum == 2) {
            boxes[j].element.style.color="red";
         }
      }
  }
}

Hanging punctuation

Prince does not support the hanging-punctuation property, but you can use the Boxtracking API to detect some punctation and make adjustments. In this example, opening quote symbols are marked in span elements. A simple script finds the width of the opening quotes and sets a corresponding negative text-indent value.

html<p><span class=q>«</span>Hello world» is a commonly used phrase in computer science...
<p><span class=q>'</span>Hello world' is a commonly used phrase in computer science...
<p><span class=q>««««</span>Hello world»»»»» is a commonly used phrase in computer science...

Note that you can put any content into the span element, not just quote marks.

Hanging punctuation using pseudo-elements

Here is a variation of the example above where we use pseudo-elements to generate quote marks:

htmlblockquote { font-size: 18pt }
blockquote::before { content: "«"; color: red }
blockquote::after { content: "»"; color: red }

Dynamic table headers

HTML has a thead element for setting the header of table. This works well when the header is the same on all pages. However, it is quite common to have a changing header, based on the current section of the table. This is similar to having running headers for chapters in a book.

This somewhat complex example creates dynamic table headers by using the boxtracking API. The strategy is to hide headers, rather than adding them. That is, the original document has duplicate headers on every other row. The script in the document below analyses the page layout and hides duplicate headers not shown at the top of a page. The elements must be hidden one at a time, then the script must be rerun. Thankfully, Prince automatically handle such reruns.

html<table>
<tr><th colspan=2>Apples
<tr><td>foo<td>bar
<tr class=duplicate><th colspan=2>Apples (cont.)
<tr><td>foo<td>bar
<tr class=duplicate><th colspan=2>Apples (cont.)
<tr><td>foo<td>bar
<tr class=duplicate><th colspan=2>Apples (cont.)
<tr><td>foo<td>bar
<tr><th colspan=2>Pears
<tr><td>foo<td>bar
<tr class=duplicate><th colspan=2>Pears (cont.)
<tr><td>foo<td>bar
<tr class=duplicate><th colspan=2>Pears (cont.)
<tr><td>foo<td>bar
...

Dump all data

This script dumps everything the Boxtracking API knows about a certain box. It starts with a CSS selector ("p") and finds the only paragraph in the document, which only has one box.

htmlPrince.trackBoxes = true;
Prince.registerPostLayoutFunc(printBoxes);

function printObject(o) {
  var out = "";
  for (var p in o) {
    out += p + ': ' + o[p] + '\n';
  }
  console.log(out);
}

function printBoxes() {
  var elements = document.querySelectorAll("p");
  for(var i=0; i<elements.length; i++) { 
      var boxes = elements[i].getPrinceBoxes();
      for(var j=0; j<boxes.length; j++) { 
         printObject(boxes[j]);
      }
  }
}

The output is:

type: BOX
pageNum: 1
x: 15.874015748031498
y: 210.8976377952756
w: 195.02362204724412
h: 27.77952755905511
baseline: null
marginTop: 0
marginRight: 0
marginBottom: 0
marginLeft: 0
borderTop: 0
borderRight: 0
borderBottom: 0
borderLeft: 0
paddingTop: 0
paddingRight: 0
paddingBottom: 0
paddingLeft: 0
floatPosition: null
children: [object BoxInfoChildren]
parent: [object BoxInfo]
element: [object HTMLParagraphElement]
pseudo: null
text: null
src: null
style: [object CSSStyleDeclaration]

Avoiding runts

In typography, a runt is a short line at the end of a paragraph. Often, it will only be a single word which didn't fit into the line above. In this text, there are two – perhaps three? – runts:

html<p>The total number of pages in the document was two. 
<p>The total number of pages in the document was three. 
<p>The total number of pages in the document was sixteen. 

Typographers often try avoid such runts, e.g. by reducing the spacing between words and letters in the paragraph.

Runts are a result of the formatting process and they can therefore only be detected after the initial formatting. Then we can use the Boxtracking API to look for runts, and change CSS properties to try avoid them.

In this script below, we define a runt to be a paragraph where the last line is less than 25% of the maximum line length. When a runt is found, the letter-spacing of the corresponding paragraph is reduced until the runt disappears. This is an iterative process where Prince reformats the document after each change.

To show where changes have been made, paragraphs with reduced letter-spacking are also given a pink background. Note how the runt has disappeared in the first two paragraphs:

htmlPrince.trackBoxes = true;
Prince.registerPostLayoutFunc(letterspacing);

function letterspacing(selector) {
   var elements = document.querySelectorAll("p");

   for(var i=0; i<elements.length; i++) {                    // loop through elements
      var thiselement = elements[i];
      var fboxes = elements[i].getPrinceBoxes();
      for (var j=0; j<fboxes.length; j++) {                  // loop through boxes
         var thisbox = fboxes[j];
         for (var k=0; k<thisbox.children.length; k++) {     // loop through lines
            var filling = (thisbox.children[k].w * 100) / thisbox.w;   
            if ((filling < 25) && (k == (thisbox.children.length - 1))) {       // if this is a runt
               var style = getComputedStyle(thiselement);
               var lsn = parseFloat(style.getPropertyValue('letter-spacing'));  // find current letter-spacing
               if (lsn) {
                  lsn -= 0.1;
               } else {
                  lsn = -0.1;
               }
               thiselement.style.letterSpacing = lsn+"pt";
               thiselement.style.background = "pink";
               Prince.registerPostLayoutFunc(letterspacing);          // rerun formatting
            }
         }
      }
   }
}

As for the third paragraph, its last line (sixteen) is too long to be considered a runt, and it is therefore left unchanged.

The script is quite simplistic; it only reduces the spacing between letters.

Avoiding runts with variable fonts

In the previous example, the letter-spacing was adjusted. In this example, we will adjust the font-stretch instead. Adjusting font-strech means that the shape of the letters is changed. Only some select fonts support this. First, they must be of the variable type. Aslo, they must have a wdth axis. In this example, we will use the freely available Noto Serif font. To make this work, you will need a recent version of Prince.

Here is the document before we start runt-chasing:

html<p>In total, the number of pages in the document was two. 
<p>In total, the number of pages in the document was three. 
<p>In total, the number of pages in the document was sixteen. 

And here is the script which avoids runts by setting font-stretch in affected paragraphs:

htmlPrince.trackBoxes = true;
Prince.registerPostLayoutFunc(tighten);

function tighten(selector) {
   var elements = document.querySelectorAll("p");

   for(var i=0; i<elements.length; i++) {                    // loop through elements
      var thiselement = elements[i];
      var fboxes = elements[i].getPrinceBoxes();
      for (var j=0; j<fboxes.length; j++) {                  // loop through boxes
         var thisbox = fboxes[j];
         for (var k=0; k<thisbox.children.length; k++) {     // loop through lines
            var filling = (thisbox.children[k].w * 100) / thisbox.w;   
            if ((filling < 30) && (k == (thisbox.children.length - 1))) {       // if this is a runt
               var style = getComputedStyle(thiselement);
               var fsi = parseInt(style.getPropertyValue('font-stretch'));     // find current setting
               console.log("fsi ",fsi);
               fsi -=2;                                      // reduce stretch by 2%
               thiselement.style.background = "pink";
               thiselement.style.fontStretch = fsi+"%";
               Prince.registerPostLayoutFunc(tighten);  
            }
         }
      }
   }
}

The pink background indicates that the styling of the paragraph has been changed by the script.

Subtotals

It is common to calculate subtotals on invoices on a per-page basis. Here's a solution which shows subtotals in deferred elements:

html<table>
<tr><td>Apples<td class=add>1
<tr><td>Oranges<td class=add>2
<tr><td>Pears<td class=add>1
...
</table>

2024-12-26 Håkon Wium Lie