XML
Overview
It is highly recommended to read the HTML section before reading this section. HTML and XML have a few similar concepts and terms that will make this section easier to understand. |
XML stands for Extensible Markup Language. XML and HTML appear and sound very similar but have different goals. Whereas HTML was designed to display data, XML was designed to store data.
Like HTML, XML still has start tags, content, and end tags, however, rather than using a finite set of "legal" or "official" tags, XML tags are defined and created by the user. What does this mean? It means the following is perfectly legal XML.
<exam>
<question>
<text>What is red?</text>
<solution>A color.</solution>
<points>2</points>
</question>
<question>
<text>What is a square?</text>
<solution>A shape.</solution>
<points>1</points>
</question>
</exam>
Rather than opening that content in a web browser and expecting it to display data, we would instead write custom software to send, receive, store, display, or do something with the data.
Specifications
The W3C provides the official specification for the Extensible Markup Language (XML). They also have a succinct description. The W3C also provides the official specifications for
XPath expressions
XPath Expressions are expressions created to programmatically select nodes or node-sets from an XML document.
The following table (from w3schools) shows some of the most useful expressions that can be used to select nodes programmatically.
Expression | Description |
---|---|
nodenname |
Selects all nodes with the name "nodename" |
/ |
Selects from the root node |
.// |
Selects noes in the document from the current node that match the selection only if they are in the current element |
. |
Selects the current node |
.. |
Selects the parent of the current node |
@ |
Selects attributes |
Predicates
A predicate is a way to filter a node-set by evaluating the expression contained within a set of square brackets []
. For example, let’s say you wanted to get all of the questions that have points > 1. Doing this using predicates is easy.
<exam>
<question>
<text>What is red?</text>
<solution>A color.</solution>
<points>2</points>
</question>
<question>
<text>What is a square?</text>
<solution>A shape.</solution>
<points>1</points>
</question>
</exam>
The XPath expression to fetch the questions that are worth more than 1 point is the following.
//question[points > 1]
Operators
In order to make comparisons, we need operators. The following is a list of XPath operators from here.
Operator | Description | Example |
---|---|---|
| |
Computes two node-sets |
//book | //cd |
+ |
Addition |
6 + 4 |
- |
Subtraction |
6 - 4 |
* |
Multiplication |
6 * 4 |
div |
Division |
8 div 4 |
= |
Equal to (relation, not assignment) |
price = 9.80 |
!= |
Not equal |
price != 9.80 |
< |
Less than |
price < 9.80 |
⇐ |
Less than or equal to |
price ⇐ 9.80 |
> |
Greater than |
price > 9.80 |
>= |
Greater than or equal to |
price >= 9.80 |
or |
Logical or |
price = 9.80 or price = 9.70 |
and |
Logical and |
price > 9.00 and price < 10.00 |
mod |
Modulus (division remainder) |
5 mod 2 |
Functions
XPath functions are an easy way to filter data with more control. You can find a list of functions here. Of particular use are the contains
and translate
functions. contains
allows you to see if the first argument string contains the second argument string. It is particularly useful to see if a class or attribute contains some substring. translate
allows you to, among other things, change the case (whether the text is capitalized or not) of some text prior to using the contains
function.
Examples
While the following examples are useful, to see examples using XPath expressions in R or Python, please check out the following links. |
The following are examples using the following XML document. You can use this tool to test your XPath expressions.
<html>
<head>
<title>My Title</title>
</head>
<body>
<div>
<div class="abc123 sktoe-sldjkt dkjfg3-dlgsk">
<div class="glkjr-slkd dkgj-0 dklfgj-00">
<a class="slkdg43lk dlks" href="https://example.com/123456">
</a>
</div>
</div>
<div>
<div class="ldskfg4">
<span class="slktjoe" aria-label="123 comments, 43 Retweets, 4000 likes">Love it.</span>
</div>
</div>
<div data-amount="12">13</div>
</div>
<div>
<div class="abc123 sktoe-sls dkjfg-dlgsk">
<div class="glkj-slkd dkgj-0 dklfj-00">
<a class="slkd3lk dls" href="https://example.com/123456">
</a>
</div>
</div>
<div>
<div class="ldg4">
<span class="sktjoe" aria-label="1000 comments, 455 Retweets, 40000 likes">Love it.</span>
</div>
</div>
<div data-amount="122">133</div>
</div>
</body>
</html>
Write an XPath expression to get every "div" element in the document with the "class" attribute having a value of "ldskfg4".
//div[@class="ldskfg4"]