Blog : PHP: Parsing HTML files with DOMDocument and DOMXpath
PHP: Parsing HTML files with DOMDocument and DOMXpath
The DOMDocument PHP class allows us to take an HTML file or HTML text input and convert it into an object that can be easily traversed and queried similar to the way things are done in JavaScript.
1. 1. Sample input
For the following examples we're working with a text input imported using the loadHTML() method, but you can just as easily import a local or remote HTML file using loadHTMLFile() instead.
The HTML is as follows, and we're aiming to extract the links and text just from the H2 elements inside the .blogArticle sections of the page - the highlighted text below - and ignore all other links:
<?PHP
$htmlinput = <<<EOT
<a href="#content">skip to content</a>
<div id="content">
<h1>H1 Heading</h1>
<p>Introductory text <a
href="intro-link1.html">link1</a> and <a
href="intro-link2.html">link2</a>.</p>
<div class="blogArticle">
<h2><a href="article1.html">Article #1 Title</a></h2>
<p>Introductory text ... <a href="article1.html">more »</a></p>
</div>
<a href="#top">Top</a>
<div class="blogArticle">
<h2><a href="article2.html">Article #2 Title</a></h2>
<p>Introductory text ... <a href="article2.html">more »</a></p>
</div>
<a href="#top">Top</a>
<div class="blogArticle">
<h2><a href="article3.html">Article #3 Title</a></h2>
<p>Introductory text ... <a href="article3.html">more »</a></p>
</div>
<a href="#top">Top</a>
<div class="blogArticle">
<h2><a href="article4.html">Article #4 Title</a></h2>
<p>Introductory text ... <a href="article4.html">more »</a></p>
</div>
<a href="#top">Top</a>
<p>Footer text <a href="footer-link.html">link</a>.</p>
</div>
<p><a href="copyright.html">Copyright © 2014</a></p>
EOT;
?>
This task would be trivial using regular expressions, but in more complicated situations the DOM approach has certain advantages.
2. 2. Finding all links in the document
To find and extract all links from an HTML document we use the getElementsByTagName method which we're familiar with from JavaScript:
<?PHP
$doc = new DOMDocument();
$doc->loadHTML($htmlinput);
// all links in document
$links = array();
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = array(
'href' => $href,
'text' => $text
);
}
?>
In this case all 17 links in the HTML are returned.
A slight improvement is to identify a containing element, in this case #content, and restrict the search that way making use of the getElementById method - also identical to it's JavaScript counterpart:
<?PHP
$doc = new DOMDocument();
$doc->loadHTML($htmlinput);
// all links in #content
$links = array();
$container = $doc->getElementById("content");
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = array(
'href' => $href,
'text' => $text
);
}
?>
This now excludes any links outside of the #content container, leaving us with 15 links.
3. 3. getElementsByClassName equivalent
There is no actual getElementsByClassName (yet) in DOMDocument, but the same results can be produced using DOMXpath as follows:
<?PHP
$doc = new DOMDocument();
$doc->loadHTML($htmlinput);
$xpath = new DOMXpath($doc);
$articles = $xpath->query('//div[@class="blogArticle"]');
// all links in .blogArticle
$links = array();
foreach($articles as $container) {
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = array(
'href' => $href,
'text' => $text
);
}
}
?>
Whereas in the previous example we searched for links in #content - a single element - we're now searching for links within multiple .blogArticle sections of the page.
The most complicated element here is the DOMXpath query //div[@class="blogArticle"], which targets all DIV elements having a className of blogArticle. In cases where there are multiple or similar class names this will need refining.
When making DOMXpath queries within another element, start the query string with .// and pass the container node as the second argument. For example:
$xpath->query('.//div[@class="post-details"]', $container);
4. 4. The final step
Now we need to single out just the links having an H2 as their parent:
<?PHP
$doc = new DOMDocument();
$doc->loadHTML($htmlinput);
$xpath = new DOMXpath($doc);
$articles = $xpath->query('//div[@class="blogArticle"]');
// all links in h2's in .blogArticle
$links = array();
foreach($articles as $container) {
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
if($item->parentNode->tagName == "h2") {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = array(
'href' => $href,
'text' => $text
);
}
}
}
?>
Finally the result we're after. The $links array now returns just four links matching the four article headings. Looking back you can see that these match the highlighted text in the input HTML.
Array
(
[0] => Array
(
[href] => article1.html
[text] => Article #1 Title
)
[1] => Array
(
[href] => article2.html
[text] => Article #2 Title
)
[2] => Array
(
[href] => article3.html
[text] => Article #3 Title
)
[3] => Array
(
[href] => article4.html
[text] => Article #4 Title
)
)
An identical approach can be used to find images in HTML - searching for the IMG tag name and usinggetAttribute to extract the SRC and other attributes.
If you're planning to use this code to spider websites, you should also read our related article on reading and obeying robots.txt.
5. 5. References
PHP: The DOMDocument class
PHP: The DOMXPath class
6. 6. Related Articles - Parsing files
Parsing HTML to find Links [PHP]
Parsing HTML files with DOMDocument and DOMXpath [PHP]
Parsing robots.txt [PHP]
< PHP