What Is SmartDOMDocument?

  • SmartDOMDocument is an enhanced version of PHP's built-in DOMDocument class.
  • SmartDOMDocument inherits from DOMDocument, so it's very easy to use – just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

 

What Is DOMDocument?

  • DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write HTML and XML.
  • Instead of using hacky regexes that are prone to breaking as soon as something you haven't thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object Model), just like your browser, and creates an easily manipulatable object in memory.
  • DOMDocument can actually validate and normalize your HTML/XML.
  • DOMDocument supports namespaces.

 

So What Exactly Does SmartDOMDocument Do Then?

DOMDocument by itself is good but has a few annoyances, which SmartDOMDocument tries to correct. Here are some things it does:

 

saveHTMLExact()

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

 

Encoding Fix

DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.

SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you – just use loadHTML() as you would normally.

 

SmartDOMDocument Object As String

You can use a SmartDOMDocument object as a string which will print out its contents.

For example:

echo "Here is the HTML: $smart_dom_doc";

 

Example

This example loads sample HTML using SmartDOMDocument, uses getElementsByTagName() to find and removeChild() to remove the first <img> tag, then prints the old HTML and the newly removed image HTML.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    $content = <<<CONTENT
<div class='class1'>
  <img src='http://www.google.com/favicon.ico' />
  Some Text
  <p>русский</p>
</div>
CONTENT;
 
    print "Before removing the image, the content is: " . htmlspecialchars($content) . "<br>";
 
    $content_doc = new SmartDOMDocument();
    $content_doc->loadHTML($content);
 
    try {
      $first_image = $content_doc->getElementsByTagName("img")->item(0);
 
      if ($first_image) {
        $first_image->parentNode->removeChild($first_image);
 
        $content = $content_doc->saveHTMLExact();
 
        $image_doc = new SmartDOMDocument();
        $image_doc->appendChild($image_doc->importNode($first_image, true));
        $image = $image_doc->saveHTMLExact();
      }
    } catch(Exception $e) { }
 
    print "After removing the image, the content is: " . htmlspecialchars($content) . "<br>";
    print "The image is: " . htmlspecialchars($image);
  }

 

Requirements And Prerequisites

  • PHP 5.2+. This is no longer a requirement – any version of PHP 5 that has DOMDocument should work now.
  • DOMDocument – this should be a built-in class but I've seen instances of it missing for some reason. My guess is 99.9% you will already have it.

 

Sounds Great – Where Do I Get It?

Download

http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk/SmartDOMDocument.class.php

 

Check out from SVN

svn co http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk SmartDOMDocument

I highly recommend using SVN (Subversion) because you can easily update to the latest version by running svn up.

 

Use as "svn:externals"

If you have an existing project in SVN and you would like to use SmartDOMDocument, you can use set up this library as svn:externals.

svn:externals is kind of like a symlink to another repository from your existing SVN project. That way, you can still benefit from using SVN commands such as svn up without having to maintain a local copy of the external code.

You can read more about setting svn:externals here.

Here's how you would do this:

1
2
3
4
cd YOUR_PROJ_DIR;
svn propset svn:externals 'SmartDOMDocument http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' .
svn ci .
svn up

 

Version History

0.4

  • No longer using C14N() because it is causing bugs (such as <br> turning into <br></br>)
  • PHP 5.2+ is no longer a requirement, due to the change above.

0.3.2

  • test/example function added

0.3.1

  • suppress warnings when loading HTML by default (this may change to use a setting later). This gets rid of "empty content", "unexpected tag", and other not well formed HTML warnings.
  • add the standard trunk/tags/branches layout to the SVN repository.

0.3

  • use a better, more portable method of dealing with encodings properly (thanks piopier).

0.2

  • use the undocumented DOMDocument->C14N() if it's available (PHP 5.2+) to save exact HTML (really, PHP? We don't document extremely useful functions anymore?).

0.1

  • initial release.

 

References

 

How To Report Bugs

You have a few options here:

  • Leave a comment here.
  • Use the contact form on the About page.
  • Send an email to admin [{at}] this site's domain.
  • IM me using the Digsby widget on the right.
Share

33 Responses to “SmartDOMDocument – A Smarter PHP DOMDocument Class”

    33 Comments:
  1. David Miles says:

    Hey man,
    Thanks first off for sharing this!
    I found a glitch when you try using saveHTMLExact() method with tags in the SmartDOMDocument, the <br /> tags get converted to <br></br> instead.

    Also, if using SmartDOMDocument to convert code in-between

    
    

    tags (such as code pastes from a form textarea) with htmlentities() and the code between the <pre> tags has a doctype in it, it get's stripped as well with saveHTMLExact().

    Thanks again.

    • Thanks for the report, David.

      I actually just experienced the same problem in my own project and saw what you're talking about.

      Just fixed it in 0.4 – it's no longer using the C14N() function that was causing this, which also removes the PHP 5.2 requirement.

      Give that one a go?

  2. David Miles says:

    Hey man,
    I was doing some more tinkering with DOMDocument and generating XHTML content and I figured a few things out.
    First off, there's an excellent tutorial on making complete XHTML documents with DOMDocument at http://www.ultramegatech.com/blog/2009/07/generating-xhtml-documents-using-domdocument-in-php/ and it was from that I discovered this:
    saveHTML doesn't output valid XHTML tags, but saveXML does. Never thought of that, but it fixes the issue with the tags <br /> being turned into <br></br> for one. In regards to that, check out LIBXML_NOEMPTYTAG constant @ http://us3.php.net/manual/en/libxml.constants.php …, look familiar?

    I am curious how difficult it would be to either implement a saveXMLExact() method into your SmartDOMDocument class or if it would need to be a new class. Unfortunately saveXML() adds the tag to the top and that's a major inconvenience if using it to make only a certain part of a document. I was looking at using it for a form generator/validation class I had written to generate the whole form, but that tag is a problem.
    Here's my example: http://pastebin.com/HhXMJnDm

    Let me know what you think and whatever you come up with.

    Thanks again.

  3. Олим says:

    Ты даже не представляешь как я благодарен тебе. Спасибо тебе огромное, брат.

  4. Kees says:

    I wish to try it but your link seems to be broken….

  5. sir says:

    yes – the link is broken. can you please fix it?
    thanks

  6. Guys, while I'm fixing the SVN server, I put a copy of the code here: http://tinypaste.com/8911ac

    I'll try to fix it asap but I am currently working through recompiling svn trunk due to a bug I need fixed which is not backported to any released version.

  7. Chris Dary says:

    Artem, this is pretty smart, thanks for it.

    I've been looking into making a DOMDocument/DOMElement implementation ON STEROIDS and this gave me the bit of inspiration I needed. I'm now using SmartDOMDocument in conjunction with a new class, SmartDOMElement, (which is registered into SmartDOMDocument in __construct via $this->registerNodeClass('DOMElement', 'SmartDOMElement'); ) – both will have enhanced functionality over their respective base classes.

    So: Thanks!

  8. dylan says:

    What I get for the Example is:

    After removing the image, the content is: <div class="class1">
    Some Text
    <p>& #1088;& #1091;& #1089;& #1089;& #1082;& #1080;& #1081;</p>
    </div>
    The image is: <img src="http://www.google.com/favicon.ico"&gt;

    Is this correct? All Russian characters are encoded.

    PS. This comment editor is really bad for entering HTML code.

  9. Asad says:

    You might want to change the regex a bit. Current one doesn't really work for me.

    preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), ", trim($this->saveHTML()));

  10. Asad says:

    Last one messed up. Let's try again.

    preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), '', trim($this->saveHTML()));
    
  11. Moustafa Samir says:

    thanks so much it really helped me :)

  12. Vbg says:

    Thanx – I was stuck with improper UTF for ages …

  13. Jeff P says:

    Hey, does this happen to fix the issue where DOMDocument will fix a broken html page (IE unclosed tags, the like)?

  14. Rudloff says:

    I really like what you have done here.
    I personnaly wrote a NewDOMElement class that helps me create markup quickly (http://code.google.com/p/chaton-cms/source/browse/trunk/classes/DOMElement.php).
    I think I will use your class in my future projects.

  15. Same says:

    Thanks a lot man, I had bad UTF-8 encoding, now the display is very good and all I had to modify in my code after including your PHP class file was "new DOMDocument()" turned into "new SmartDOMDocument()".

  16. Nisse says:

    Super! This saved me countless hours of swearing at the UTF-8 encoding not working properly. This just worked out of the box! Question though, i'm getting the HTML entities instead of the UTF-8 characters directly for example, åäö becomes å ä ö. How would i fix that?

  17. Gustav Nilsson says:

    Solved all the problems in the world!

    Cheers man, gr8 extension :D

  18. Daniel Carbone says:

    Hey just wanted to say HUGE kudos for extending DOMDocument like this! It made my job that much easier!

  19. Hermes says:

    You saved my ass man. Thank you! :) :) :)

  20. Johnson says:

    I am using IntelliJ, does anyone know of a way to make it so you can see the child nodes of the SmartDomDocument Object? Right now I am getting an empty array though it is working correctly for me.

    Thanks for this, it has been such a life saver!

  21. Jeff W says:

    I'm trying to parse some information for a remote webpage using your class and it seems to just strip the special chars (and in one case doesn't work right):

    My source code:
    http://beta.thebrews.us/test.phps

    Example Urls that do not work: (use my script ?b=n)

    http://www.ratebeer.com/beer/84264/ (you can see the "e" getting stripped in the output)

    http://www.ratebeer.com/beer/150344/ (you can see the output doesn't work right)

    Any input would be greatly appreciated. :)

  22. Eloise says:

    Misdiagnoses, lack of communication, and miserable readings of tests is what left my mother in a recent degree shaped and may develop from 2 cm-10 cm in diameter.

    This will help corroborate the a tratamiento quir�rgico y se realiz�
    ex�resis. Ein australischer needed to assuage stress and keep ego respect,
    which may be lipoma removal or a natural lipoma handling selection.

    my page – Fatty Tissue Lumps

  23. Lora says:

    The fatty acids can become diseases can be identified before they get developed
    into stern health issues. And among mollusk, which
    many people feature been advised to forfend because of and Gallo Pinto, the equally terrific breakfast dish volition definitely perk up mealtime at your mansion.
    Some cholesterol testers or try out kits is key. If you erotic love eggs, you can proffer you, I was able to
    lower your Cholesterol 60 points in 6 months.

    Feel free to visit my website – best cholesterol high natural remedies

  24. Donny says:

    A striking 87% of Lupus. To overcome pain analgesics can be administered orally or intravenously – like mentioned earlier, early identification of underlying muscle or tissues of the way to helping you or someone you know?

    Also visit my website – http://lupuscontrol.com/

  25. Celsa says:

    blogging is plausibly the almost revolutionary and
    category involved.

    My web-site Celsa

  26. Steven Vachon says:

    Using this on my site and may use it for one of my open source projects. Thanks! By the way, you got some spam messages on here that need to be deleted ;)

  27. Edna says:

    You are the inside the ledge, thighs and groyne areas. The ship's company offers a 90 day In that respect is no magic trick Hummer to removing cellulite. throughout a man's animation, the human to compact, making your pelt at once tauter.

    But all in all, many creams do run and are upright-bye cellulite Gel-Cream to assist VISIBLY cut down the
    Show of cellulite and help oneself to step-up Overall bikini
    authority. so exercising can do wonders in your consistency not simply

    My web page: cellulite treatment at home

  28. michael says:

    the svn is not found please upload again

  29. Roxanne says:

    I know this web page gives quality dependent content and additional information, is
    there any other web site which offers these stuff in quality?

    Here is my blog – Anti-Cellulite Program

  30. Paul says:

    Thanks!! You saved me hours with this.
    I was having trouble with both the unwanted body tags and the character encoding being mishandled by the DOMDocument Class in a Drupal module. You're class did a great job of cleaning it up quick and easy. Nice work.

Leave a Reply