What Is SmartDOMDocument?

  • SmartDOMDocument is an enhanced version of PHP's built-in DOMDocument class.
  • SmartDOMDocument inherits from DOMDocument, so it's very easy to use – just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

 

What Is DOMDocument?

  • DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write HTML and XML.
  • Instead of using hacky regexes that are prone to breaking as soon as something you haven't thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object Model), just like your browser, and creates an easily manipulatable object in memory.
  • DOMDocument can actually validate and normalize your HTML/XML.
  • DOMDocument supports namespaces.

 

So What Exactly Does SmartDOMDocument Do Then?

DOMDocument by itself is good but has a few annoyances, which SmartDOMDocument tries to correct. Here are some things it does:

 

saveHTMLExact()

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

 

Encoding Fix

DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.

SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you – just use loadHTML() as you would normally.

 

SmartDOMDocument Object As String

You can use a SmartDOMDocument object as a string which will print out its contents.

For example:

echo "Here is the HTML: $smart_dom_doc";

 

Example

This example loads sample HTML using SmartDOMDocument, uses getElementsByTagName() to find and removeChild() to remove the first <img> tag, then prints the old HTML and the newly removed image HTML.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    $content = <<<CONTENT
<div class='class1'>
  <img src='http://www.google.com/favicon.ico' />
  Some Text
  <p>русский</p>
</div>
CONTENT;
 
    print "Before removing the image, the content is: " . htmlspecialchars($content) . "<br>";
 
    $content_doc = new SmartDOMDocument();
    $content_doc->loadHTML($content);
 
    try {
      $first_image = $content_doc->getElementsByTagName("img")->item(0);
 
      if ($first_image) {
        $first_image->parentNode->removeChild($first_image);
 
        $content = $content_doc->saveHTMLExact();
 
        $image_doc = new SmartDOMDocument();
        $image_doc->appendChild($image_doc->importNode($first_image, true));
        $image = $image_doc->saveHTMLExact();
      }
    } catch(Exception $e) { }
 
    print "After removing the image, the content is: " . htmlspecialchars($content) . "<br>";
    print "The image is: " . htmlspecialchars($image);
  }

 

Requirements And Prerequisites

  • PHP 5.2+. This is no longer a requirement – any version of PHP 5 that has DOMDocument should work now.
  • DOMDocument – this should be a built-in class but I've seen instances of it missing for some reason. My guess is 99.9% you will already have it.

 

Sounds Great – Where Do I Get It?

Download

http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk/SmartDOMDocument.class.php

 

Check out from SVN

svn co http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk SmartDOMDocument

I highly recommend using SVN (Subversion) because you can easily update to the latest version by running svn up.

 

Use as "svn:externals"

If you have an existing project in SVN and you would like to use SmartDOMDocument, you can use set up this library as svn:externals.

svn:externals is kind of like a symlink to another repository from your existing SVN project. That way, you can still benefit from using SVN commands such as svn up without having to maintain a local copy of the external code.

You can read more about setting svn:externals here.

Here's how you would do this:

1
2
3
4
cd YOUR_PROJ_DIR;
svn propset svn:externals 'SmartDOMDocument http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' .
svn ci .
svn up

 

Version History

0.4

  • No longer using C14N() because it is causing bugs (such as <br> turning into <br></br>)
  • PHP 5.2+ is no longer a requirement, due to the change above.

0.3.2

  • test/example function added

0.3.1

  • suppress warnings when loading HTML by default (this may change to use a setting later). This gets rid of "empty content", "unexpected tag", and other not well formed HTML warnings.
  • add the standard trunk/tags/branches layout to the SVN repository.

0.3

  • use a better, more portable method of dealing with encodings properly (thanks piopier).

0.2

  • use the undocumented DOMDocument->C14N() if it's available (PHP 5.2+) to save exact HTML (really, PHP? We don't document extremely useful functions anymore?).

0.1

  • initial release.

 

References

 

How To Report Bugs

You have a few options here:

  • Leave a comment here.
  • Use the contact form on the About page.
  • Send an email to admin [{at}] this site's domain.
  • IM me using the Digsby widget on the right.
Share

22 Responses to “SmartDOMDocument – A Smarter PHP DOMDocument Class”

    22 Comments:
  1. Hey man,
    Thanks first off for sharing this!
    I found a glitch when you try using saveHTMLExact() method with tags in the SmartDOMDocument, the <br /> tags get converted to <br></br> instead.

    Also, if using SmartDOMDocument to convert code in-between

    tags (such as code pastes from a form textarea) with htmlentities() and the code between the <pre> tags has a doctype in it, it get's stripped as well with saveHTMLExact().

    Thanks again.

    • Thanks for the report, David.

      I actually just experienced the same problem in my own project and saw what you're talking about.

      Just fixed it in 0.4 – it's no longer using the C14N() function that was causing this, which also removes the PHP 5.2 requirement.

      Give that one a go?

  2. Hey man,
    I was doing some more tinkering with DOMDocument and generating XHTML content and I figured a few things out.
    First off, there's an excellent tutorial on making complete XHTML documents with DOMDocument at http://www.ultramegatech.com/blog/2009/07/generating-xhtml-documents-using-domdocument-in-php/ and it was from that I discovered this:
    saveHTML doesn't output valid XHTML tags, but saveXML does. Never thought of that, but it fixes the issue with the tags <br /> being turned into <br></br> for one. In regards to that, check out LIBXML_NOEMPTYTAG constant @ http://us3.php.net/manual/en/libxml.constants.php …, look familiar?

    I am curious how difficult it would be to either implement a saveXMLExact() method into your SmartDOMDocument class or if it would need to be a new class. Unfortunately saveXML() adds the tag to the top and that's a major inconvenience if using it to make only a certain part of a document. I was looking at using it for a form generator/validation class I had written to generate the whole form, but that tag is a problem.
    Here's my example: http://pastebin.com/HhXMJnDm

    Let me know what you think and whatever you come up with.

    Thanks again.

  3. Олим says:

    Ты даже не представляешь как я благодарен тебе. Спасибо тебе огромное, брат.

  4. Kees says:

    I wish to try it but your link seems to be broken….

  5. sir says:

    yes – the link is broken. can you please fix it?
    thanks

  6. Guys, while I'm fixing the SVN server, I put a copy of the code here: http://tinypaste.com/8911ac

    I'll try to fix it asap but I am currently working through recompiling svn trunk due to a bug I need fixed which is not backported to any released version.

  7. Chris Dary says:

    Artem, this is pretty smart, thanks for it.

    I've been looking into making a DOMDocument/DOMElement implementation ON STEROIDS and this gave me the bit of inspiration I needed. I'm now using SmartDOMDocument in conjunction with a new class, SmartDOMElement, (which is registered into SmartDOMDocument in __construct via $this->registerNodeClass('DOMElement', 'SmartDOMElement'); ) – both will have enhanced functionality over their respective base classes.

    So: Thanks!

  8. dylan says:

    What I get for the Example is:

    After removing the image, the content is: <div class="class1">
    Some Text
    <p>& #1088;& #1091;& #1089;& #1089;& #1082;& #1080;& #1081;</p>
    </div>
    The image is: <img src="http://www.google.com/favicon.ico"&gt;

    Is this correct? All Russian characters are encoded.

    PS. This comment editor is really bad for entering HTML code.

  9. Asad says:

    You might want to change the regex a bit. Current one doesn't really work for me.

    preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), ", trim($this->saveHTML()));

  10. Asad says:

    Last one messed up. Let's try again.

    preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), '', trim($this->saveHTML()));
    
  11. Moustafa Samir says:

    thanks so much it really helped me :)

  12. Vbg says:

    Thanx – I was stuck with improper UTF for ages …

  13. Jeff P says:

    Hey, does this happen to fix the issue where DOMDocument will fix a broken html page (IE unclosed tags, the like)?

  14. Rudloff says:

    I really like what you have done here.
    I personnaly wrote a NewDOMElement class that helps me create markup quickly (http://code.google.com/p/chaton-cms/source/browse/trunk/classes/DOMElement.php).
    I think I will use your class in my future projects.

  15. Same says:

    Thanks a lot man, I had bad UTF-8 encoding, now the display is very good and all I had to modify in my code after including your PHP class file was "new DOMDocument()" turned into "new SmartDOMDocument()".

  16. Nisse says:

    Super! This saved me countless hours of swearing at the UTF-8 encoding not working properly. This just worked out of the box! Question though, i'm getting the HTML entities instead of the UTF-8 characters directly for example, åäö becomes å ä ö. How would i fix that?

  17. Gustav Nilsson says:

    Solved all the problems in the world!

    Cheers man, gr8 extension :D

  18. Daniel Carbone says:

    Hey just wanted to say HUGE kudos for extending DOMDocument like this! It made my job that much easier!

  19. Hermes says:

    You saved my ass man. Thank you! :) :) :)

Leave a Reply