XML

Overview

It is highly recommended to read the HTML section before reading this section. HTML and XML have a few similar concepts and terms that will make this section easier to understand.

XML stands for Extensible Markup Language. XML and HTML appear and sound very similar but have different goals. Whereas HTML was designed to display data, XML was designed to store data.

Like HTML, XML still has start tags, content, and end tags, however, rather than using a finite set of "legal" or "official" tags, XML tags are defined and created by the user. What does this mean? It means the following is perfectly legal XML.

<exam>
  <question>
    <text>What is red?</text>
    <solution>A color.</solution>
    <points>2</points>
  </question>
  <question>
    <text>What is a square?</text>
    <solution>A shape.</solution>
    <points>1</points>
  </question>
</exam>

Rather than opening that content in a web browser and expecting it to display data, we would instead write custom software to send, receive, store, display, or do something with the data.

Specifications

The W3C provides the official specification for the Extensible Markup Language (XML). They also have a succinct description. The W3C also provides the official specifications for

XPath expressions

XPath Expressions are expressions created to programmatically select nodes or node-sets from an XML document.

The following table (from w3schools) shows some of the most useful expressions that can be used to select nodes programmatically.

Expression Description

nodenname

Selects all nodes with the name "nodename"

/

Selects from the root node

.//

Selects noes in the document from the current node that match the selection only if they are in the current element

.

Selects the current node

..

Selects the parent of the current node

@

Selects attributes

Predicates

A predicate is a way to filter a node-set by evaluating the expression contained within a set of square brackets []. For example, let’s say you wanted to get all of the questions that have points > 1. Doing this using predicates is easy.

<exam>
  <question>
    <text>What is red?</text>
    <solution>A color.</solution>
    <points>2</points>
  </question>
  <question>
    <text>What is a square?</text>
    <solution>A shape.</solution>
    <points>1</points>
  </question>
</exam>

The XPath expression to fetch the questions that are worth more than 1 point is the following.

//question[points > 1]

Operators

In order to make comparisons, we need operators. The following is a list of XPath operators from here.

Operator Description Example

|

Computes two node-sets

//book | //cd

+

Addition

6 + 4

-

Subtraction

6 - 4

*

Multiplication

6 * 4

div

Division

8 div 4

=

Equal to (relation, not assignment)

price = 9.80

!=

Not equal

price != 9.80

<

Less than

price < 9.80

Less than or equal to

price ⇐ 9.80

>

Greater than

price > 9.80

>=

Greater than or equal to

price >= 9.80

or

Logical or

price = 9.80 or price = 9.70

and

Logical and

price > 9.00 and price < 10.00

mod

Modulus (division remainder)

5 mod 2

Functions

XPath functions are an easy way to filter data with more control. You can find a list of functions here. Of particular use are the contains and translate functions. contains allows you to see if the first argument string contains the second argument string. It is particularly useful to see if a class or attribute contains some substring. translate allows you to, among other things, change the case (whether the text is capitalized or not) of some text prior to using the contains function.

Examples

While the following examples are useful, to see examples using XPath expressions in R or Python, please check out the following links.

The following are examples using the following XML document. You can use this tool to test your XPath expressions.

<html>
    <head>
        <title>My Title</title>
    </head>
    <body>
        <div>
            <div class="abc123 sktoe-sldjkt dkjfg3-dlgsk">
                <div class="glkjr-slkd dkgj-0 dklfgj-00">
                    <a class="slkdg43lk dlks" href="https://example.com/123456">
                    </a>
                </div>
            </div>
            <div>
                <div class="ldskfg4">
                    <span class="slktjoe" aria-label="123 comments, 43 Retweets, 4000 likes">Love it.</span>
                </div>
            </div>
            <div data-amount="12">13</div>
        </div>
        <div>
            <div class="abc123 sktoe-sls dkjfg-dlgsk">
                <div class="glkj-slkd dkgj-0 dklfj-00">
                    <a class="slkd3lk dls" href="https://example.com/123456">
                    </a>
                </div>
            </div>
            <div>
                <div class="ldg4">
                    <span class="sktjoe" aria-label="1000 comments, 455 Retweets, 40000 likes">Love it.</span>
                </div>
            </div>
            <div data-amount="122">133</div>
        </div>
    </body>
</html>

Write an XPath expression to get the "title" element.

Solution
//title

Write an XPath expression to get the content of the "title" element.

Solution
//title/text

Write an XPath expression to get every "div" element in the document.

Solution
//div

Write an XPath expression to get every "div" element in the document with the "class" attribute having a value of "ldskfg4".

Solution
//div[@class="ldskfg4"]

Write an XPath expression to get every "div" element where the string "abc123" is in the "class" attribute’s value (as a substring).

Solution
//div[contains(@class, 'abc123')]

Write an XPath expression to get every "div" element with an "aria-label" attribute.

Solution
//div[@aria-label]

GPX

Resources

A w3schools tutorial on XPath expressions. A nice 1 page summary.

A simple, but very thorough cheatsheet to recall XPath expressions.

A great, online tool for testing XPath expressions.