What Is SmartDOMDocument?

  • SmartDOMDocument is an enhanced version of PHP's built-in DOMDocument class.
  • SmartDOMDocument inherits from DOMDocument, so it's very easy to use – just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

 

What Is DOMDocument?

  • DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write HTML and XML.
  • Instead of using hacky regexes that are prone to breaking as soon as something you haven't thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object Model), just like your browser, and creates an easily manipulatable object in memory.
  • DOMDocument can actually validate and normalize your HTML/XML.
  • DOMDocument supports namespaces.

 

So What Exactly Does SmartDOMDocument Do Then?

DOMDocument by itself is good but has a few annoyances, which SmartDOMDocument tries to correct. Here are some things it does:

 

saveHTMLExact()

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

 

Encoding Fix

DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.

SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you – just use loadHTML() as you would normally.

 

SmartDOMDocument Object As String

You can use a SmartDOMDocument object as a string which will print out its contents.

For example:

echo "Here is the HTML: $smart_dom_doc";

 

Example

This example loads sample HTML using SmartDOMDocument, uses getElementsByTagName() to find and removeChild() to remove the first <img> tag, then prints the old HTML and the newly removed image HTML.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    $content = <<<CONTENT
<div class='class1'>
  <img src='http://www.google.com/favicon.ico' />
  Some Text
  <p>русский</p>
</div>
CONTENT;
 
    print "Before removing the image, the content is: " . htmlspecialchars($content) . "<br>";
 
    $content_doc = new SmartDOMDocument();
    $content_doc->loadHTML($content);
 
    try {
      $first_image = $content_doc->getElementsByTagName("img")->item(0);
 
      if ($first_image) {
        $first_image->parentNode->removeChild($first_image);
 
        $content = $content_doc->saveHTMLExact();
 
        $image_doc = new SmartDOMDocument();
        $image_doc->appendChild($image_doc->importNode($first_image, true));
        $image = $image_doc->saveHTMLExact();
      }
    } catch(Exception $e) { }
 
    print "After removing the image, the content is: " . htmlspecialchars($content) . "<br>";
    print "The image is: " . htmlspecialchars($image);
  }

 

Requirements And Prerequisites

  • PHP 5.2+. This is no longer a requirement – any version of PHP 5 that has DOMDocument should work now.
  • DOMDocument – this should be a built-in class but I've seen instances of it missing for some reason. My guess is 99.9% you will already have it.

 

Sounds Great – Where Do I Get It?

Download

http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk/SmartDOMDocument.class.php

 

Check out from SVN

svn co http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk SmartDOMDocument

I highly recommend using SVN (Subversion) because you can easily update to the latest version by running svn up.

 

Use as "svn:externals"

If you have an existing project in SVN and you would like to use SmartDOMDocument, you can use set up this library as svn:externals.

svn:externals is kind of like a symlink to another repository from your existing SVN project. That way, you can still benefit from using SVN commands such as svn up without having to maintain a local copy of the external code.

You can read more about setting svn:externals here.

Here's how you would do this:

1
2
3
4
cd YOUR_PROJ_DIR;
svn propset svn:externals 'SmartDOMDocument http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' .
svn ci .
svn up

 

Version History

0.4.1

  • added return value to loadHTML() (thanks, Grey)

0.4

  • No longer using C14N() because it is causing bugs (such as <br> turning into <br></br>)
  • PHP 5.2+ is no longer a requirement, due to the change above.

0.3.2

  • test/example function added

0.3.1

  • suppress warnings when loading HTML by default (this may change to use a setting later). This gets rid of "empty content", "unexpected tag", and other not well formed HTML warnings.
  • add the standard trunk/tags/branches layout to the SVN repository.

0.3

  • use a better, more portable method of dealing with encodings properly (thanks piopier).

0.2

  • use the undocumented DOMDocument->C14N() if it's available (PHP 5.2+) to save exact HTML (really, PHP? We don't document extremely useful functions anymore?).

0.1

  • initial release.

 

References

 

How To Report Bugs

You have a few options here:

  • Leave a comment here.
  • Use the contact form on the About page.
  • Send an email to admin [{at}] this site's domain.
  • IM me using the Digsby widget on the right.
Share
  • http://amereservant.com David Miles

    Hey man,
    Thanks first off for sharing this!
    I found a glitch when you try using saveHTMLExact() method with tags in the SmartDOMDocument, the <br /> tags get converted to <br></br> instead.

    Also, if using SmartDOMDocument to convert code in-between

    
    

    tags (such as code pastes from a form textarea) with htmlentities() and the code between the <pre> tags has a doctype in it, it get's stripped as well with saveHTMLExact().

    Thanks again.

    • http://beerpla.net Artem Russakovskii

      Thanks for the report, David.

      I actually just experienced the same problem in my own project and saw what you're talking about.

      Just fixed it in 0.4 – it's no longer using the C14N() function that was causing this, which also removes the PHP 5.2 requirement.

      Give that one a go?

  • http://amereservant.com David Miles

    Hey man,
    I was doing some more tinkering with DOMDocument and generating XHTML content and I figured a few things out.
    First off, there's an excellent tutorial on making complete XHTML documents with DOMDocument at http://www.ultramegatech.com/blog/2009/07/generating-xhtml-documents-using-domdocument-in-php/ and it was from that I discovered this:
    saveHTML doesn't output valid XHTML tags, but saveXML does. Never thought of that, but it fixes the issue with the tags <br /> being turned into <br></br> for one. In regards to that, check out LIBXML_NOEMPTYTAG constant @ http://us3.php.net/manual/en/libxml.constants.php …, look familiar?

    I am curious how difficult it would be to either implement a saveXMLExact() method into your SmartDOMDocument class or if it would need to be a new class. Unfortunately saveXML() adds the tag to the top and that's a major inconvenience if using it to make only a certain part of a document. I was looking at using it for a form generator/validation class I had written to generate the whole form, but that tag is a problem.
    Here's my example: http://pastebin.com/HhXMJnDm

    Let me know what you think and whatever you come up with.

    Thanks again.

  • Олим

    Ты даже не представляешь как я благодарен тебе. Спасибо тебе огромное, брат.

  • Kees

    I wish to try it but your link seems to be broken….

  • sir

    yes – the link is broken. can you please fix it?
    thanks

    • chrisme

      http and svn download not working… very interested to try your code ;D

    • Keith Carter

      Same problem here.

  • http://beerpla.net Artem Russakovskii

    Guys, while I'm fixing the SVN server, I put a copy of the code here: http://tinypaste.com/8911ac

    I'll try to fix it asap but I am currently working through recompiling svn trunk due to a bug I need fixed which is not backported to any released version.

  • Chris Dary

    Artem, this is pretty smart, thanks for it.

    I've been looking into making a DOMDocument/DOMElement implementation ON STEROIDS and this gave me the bit of inspiration I needed. I'm now using SmartDOMDocument in conjunction with a new class, SmartDOMElement, (which is registered into SmartDOMDocument in __construct via $this->registerNodeClass('DOMElement', 'SmartDOMElement'); ) – both will have enhanced functionality over their respective base classes.

    So: Thanks!

  • dylan

    What I get for the Example is:

    After removing the image, the content is: <div class="class1">
    Some Text
    <p>& #1088;& #1091;& #1089;& #1089;& #1082;& #1080;& #1081;</p>
    </div>
    The image is: <img src="http://www.google.com/favicon.ico"&gt;

    Is this correct? All Russian characters are encoded.

    PS. This comment editor is really bad for entering HTML code.

  • Asad

    You might want to change the regex a bit. Current one doesn't really work for me.

    preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), ", trim($this->saveHTML()));

  • Asad

    Last one messed up. Let's try again.

    preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), '', trim($this->saveHTML()));
    
  • Moustafa Samir

    thanks so much it really helped me :)

  • Vbg

    Thanx – I was stuck with improper UTF for ages …

  • Jeff P

    Hey, does this happen to fix the issue where DOMDocument will fix a broken html page (IE unclosed tags, the like)?

  • Rudloff

    I really like what you have done here.
    I personnaly wrote a NewDOMElement class that helps me create markup quickly (http://code.google.com/p/chaton-cms/source/browse/trunk/classes/DOMElement.php).
    I think I will use your class in my future projects.

  • Same

    Thanks a lot man, I had bad UTF-8 encoding, now the display is very good and all I had to modify in my code after including your PHP class file was "new DOMDocument()" turned into "new SmartDOMDocument()".

  • Nisse

    Super! This saved me countless hours of swearing at the UTF-8 encoding not working properly. This just worked out of the box! Question though, i'm getting the HTML entities instead of the UTF-8 characters directly for example, åäö becomes å ä ö. How would i fix that?

  • Gustav Nilsson

    Solved all the problems in the world!

    Cheers man, gr8 extension :D

  • Daniel Carbone

    Hey just wanted to say HUGE kudos for extending DOMDocument like this! It made my job that much easier!

  • Hermes

    You saved my ass man. Thank you! :) :) :)

  • Johnson

    I am using IntelliJ, does anyone know of a way to make it so you can see the child nodes of the SmartDomDocument Object? Right now I am getting an empty array though it is working correctly for me.

    Thanks for this, it has been such a life saver!

  • Jeff W

    I'm trying to parse some information for a remote webpage using your class and it seems to just strip the special chars (and in one case doesn't work right):

    My source code:
    http://beta.thebrews.us/test.phps

    Example Urls that do not work: (use my script ?b=n)

    http://www.ratebeer.com/beer/84264/ (you can see the "e" getting stripped in the output)

    http://www.ratebeer.com/beer/150344/ (you can see the output doesn't work right)

    Any input would be greatly appreciated. :)

  • Steven Vachon

    Using this on my site and may use it for one of my open source projects. Thanks! By the way, you got some spam messages on here that need to be deleted ;)

    • http://beerpla.net Artem Russakovskii

      Just took care of them, and I'll switch to Disqus shortly.

  • michael

    the svn is not found please upload again

  • Dodzi Dzakuma

    This looks amazing. I'm going to try it out. You wouldn't happen to have a github or bitbucket repository of this, would you?

  • Grey

    Hmm, also your loadHTML breaks the default functionality of loadHTML which has a boolean return, whereas yours does not return. Just change to

    public function loadHTML($html, $encoding = "UTF-8")
    {
    $html = mb_convert_encoding($html, 'HTML-ENTITIES', $encoding);
    return @parent::loadHTML($html);
    // suppress warnings
    }

    • http://www.androidpolice.com/ Artem Russakovskii

      Thanks, just committed 0.4.1 with this change.