SmartDOMDocument – A Smarter PHP DOMDocument Class

What Is SmartDOMDocument?
What Is DOMDocument?
So What Exactly Does SmartDOMDocument Do Then?
saveHTMLExact()
Encoding Fix
SmartDOMDocument Object As String
Example
Requirements And Prerequisites
Sounds Great – Where Do I Get It?
Download
Check out from SVN
Use as "svn:externals"
Git
Version History
References
How To Report Bugs
Comments (37)

What Is SmartDOMDocument?

SmartDOMDocument is an enhanced version of PHP's built-in DOMDocument class.
SmartDOMDocument inherits from DOMDocument, so it's very easy to use – just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

What Is DOMDocument?

DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write HTML and XML.
Instead of using hacky regexes that are prone to breaking as soon as something you haven't thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object Model), just like your browser, and creates an easily manipulatable object in memory.
DOMDocument can actually validate and normalize your HTML/XML.
DOMDocument supports namespaces.

So What Exactly Does SmartDOMDocument Do Then?

DOMDocument by itself is good but has a few annoyances, which SmartDOMDocument tries to correct. Here are some things it does:

saveHTMLExact()

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

Encoding Fix

DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.

SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you – just use loadHTML() as you would normally.

SmartDOMDocument Object As String

You can use a SmartDOMDocument object as a string which will print out its contents.

For example:

echo &quot;Here is the HTML: $smart_dom_doc&quot;;

Example

This example loads sample HTML using SmartDOMDocument, uses getElementsByTagName() to find and removeChild() to remove the first <img> tag, then prints the old HTML and the newly removed image HTML.

    $content = &lt;&lt;&lt;CONTENT
&lt;div class='class1'&gt;
  &lt;img src='http://www.google.com/favicon.ico' /&gt;
  Some Text
  &lt;p&gt;???????&lt;/p&gt;
&lt;/div&gt;
CONTENT;
 
    print &quot;Before removing the image, the content is: &quot; . htmlspecialchars($content) . &quot;&lt;br&gt;&quot;;
 
    $content_doc = new SmartDOMDocument();
    $content_doc-&gt;loadHTML($content);
 
    try {
      $first_image = $content_doc-&gt;getElementsByTagName(&quot;img&quot;)-&gt;item(0);
 
      if ($first_image) {
        $first_image-&gt;parentNode-&gt;removeChild($first_image);
 
        $content = $content_doc-&gt;saveHTMLExact();
 
        $image_doc = new SmartDOMDocument();
        $image_doc-&gt;appendChild($image_doc-&gt;importNode($first_image, true));
        $image = $image_doc-&gt;saveHTMLExact();
      }
    } catch(Exception $e) { }
 
    print &quot;After removing the image, the content is: &quot; . htmlspecialchars($content) . &quot;&lt;br&gt;&quot;;
    print &quot;The image is: &quot; . htmlspecialchars($image);
  }

Requirements And Prerequisites

~~PHP 5.2+.~~ This is no longer a requirement – any version of PHP 5 that has DOMDocument should work now.
DOMDocument – this should be a built-in class but I've seen instances of it missing for some reason. My guess is 99.9% you will already have it.

Sounds Great – Where Do I Get It?

Download

http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk/SmartDOMDocument.class.php

Check out from SVN

svn co <a href="http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk">http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk</a> SmartDOMDocument

I highly recommend using SVN (Subversion) because you can easily update to the latest version by running svn up.

Use as "svn:externals"

If you have an existing project in SVN and you would like to use SmartDOMDocument, you can use set up this library as svn:externals.

svn:externals is kind of like a symlink to another repository from your existing SVN project. That way, you can still benefit from using SVN commands such as svn up without having to maintain a local copy of the external code.

You can read more about setting svn:externals here.

Here's how you would do this:

cd YOUR_PROJ_DIR;
svn propset svn:externals 'SmartDOMDocument http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' .
svn ci .
svn up

Git

Update 9/11/2015: I have moved the code from svn to git (Bitbucket). You can now find it here and if you have contributions, feel free to send a pull request.

Version History

0.4.1

added return value to loadHTML() (thanks, Grey)

0.4

No longer using C14N() because it is causing bugs (such as <br> turning into <br></br>)
PHP 5.2+ is no longer a requirement, due to the change above.

0.3.2

test/example function added

0.3.1

suppress warnings when loading HTML by default (this may change to use a setting later). This gets rid of "empty content", "unexpected tag", and other not well formed HTML warnings.
add the standard trunk/tags/branches layout to the SVN repository.

0.3

use a better, more portable method of dealing with encodings properly (thanks piopier).

0.2

use the undocumented DOMDocument->C14N() if it's available (PHP 5.2+) to save exact HTML (really, PHP? We don't document extremely useful functions anymore?).

0.1

initial release.

References

PHP.net DOMDocument class reference

How To Report Bugs

You have a few options here:

Leave a comment here.
Send a pull request.

Artem Russakovskii's programming and technology blog

SmartDOMDocument – A Smarter PHP DOMDocument Class

What Is SmartDOMDocument?

What Is DOMDocument?

So What Exactly Does SmartDOMDocument Do Then?

saveHTMLExact()

Encoding Fix

SmartDOMDocument Object As String

Example

Requirements And Prerequisites

Sounds Great – Where Do I Get It?

Download

Check out from SVN

Use as "svn:externals"

Git

Version History

References

How To Report Bugs

About Me

Pages

Categories

My Sites

Recent Comments

Artem Russakovskii's programming and technology blog

SmartDOMDocument – A Smarter PHP DOMDocument Class

What Is SmartDOMDocument?

What Is DOMDocument?

So What Exactly Does SmartDOMDocument Do Then?

saveHTMLExact()

Encoding Fix

SmartDOMDocument Object As String

Example

Requirements And Prerequisites

Sounds Great – Where Do I Get It?

Download

Check out from SVN

Use as "svn:externals"

Git

Version History

References

How To Report Bugs

About Me

Pages

Categories

Tag Cloud

My Sites

Recent Comments