- What Is SmartDOMDocument?
- What Is DOMDocument?
- So What Exactly Does SmartDOMDocument Do Then?
- saveHTMLExact()
- Encoding Fix
- SmartDOMDocument Object As String
- Example
- Requirements And Prerequisites
- Sounds Great – Where Do I Get It?
- Download
- Check out from SVN
- Use as "svn:externals"
- Version History
- References
- How To Report Bugs
- Comments (19)
What Is SmartDOMDocument?
- SmartDOMDocument is an enhanced version of PHP's built-in DOMDocument class.
- SmartDOMDocument inherits from DOMDocument, so it's very easy to use – just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).
What Is DOMDocument?
- DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write HTML and XML.
- Instead of using hacky regexes that are prone to breaking as soon as something you haven't thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object Model), just like your browser, and creates an easily manipulatable object in memory.
- DOMDocument can actually validate and normalize your HTML/XML.
- DOMDocument supports namespaces.
So What Exactly Does SmartDOMDocument Do Then?
DOMDocument by itself is good but has a few annoyances, which SmartDOMDocument tries to correct. Here are some things it does:
saveHTMLExact()
DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.
Encoding Fix
DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you – just use loadHTML() as you would normally.
SmartDOMDocument Object As String
You can use a SmartDOMDocument object as a string which will print out its contents.
For example:
echo "Here is the HTML: $smart_dom_doc";
Example
This example loads sample HTML using SmartDOMDocument, uses getElementsByTagName() to find and removeChild() to remove the first <img> tag, then prints the old HTML and the newly removed image HTML.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | $content = <<<CONTENT
<div class='class1'>
<img src='http://www.google.com/favicon.ico' />
Some Text
<p>русский</p>
</div>
CONTENT;
print "Before removing the image, the content is: " . htmlspecialchars($content) . "<br>";
$content_doc = new SmartDOMDocument();
$content_doc->loadHTML($content);
try {
$first_image = $content_doc->getElementsByTagName("img")->item(0);
if ($first_image) {
$first_image->parentNode->removeChild($first_image);
$content = $content_doc->saveHTMLExact();
$image_doc = new SmartDOMDocument();
$image_doc->appendChild($image_doc->importNode($first_image, true));
$image = $image_doc->saveHTMLExact();
}
} catch(Exception $e) { }
print "After removing the image, the content is: " . htmlspecialchars($content) . "<br>";
print "The image is: " . htmlspecialchars($image);
} |
Requirements And Prerequisites
PHP 5.2+.This is no longer a requirement – any version of PHP 5 that has DOMDocument should work now.- DOMDocument – this should be a built-in class but I've seen instances of it missing for some reason. My guess is 99.9% you will already have it.
Sounds Great – Where Do I Get It?
Download
http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk/SmartDOMDocument.class.php
Check out from SVN
svn co http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk SmartDOMDocument
I highly recommend using SVN (Subversion) because you can easily update to the latest version by running svn up.
Use as "svn:externals"
If you have an existing project in SVN and you would like to use SmartDOMDocument, you can use set up this library as svn:externals.
svn:externals is kind of like a symlink to another repository from your existing SVN project. That way, you can still benefit from using SVN commands such as svn up without having to maintain a local copy of the external code.
You can read more about setting svn:externals here.
Here's how you would do this:
1 2 3 4 | cd YOUR_PROJ_DIR; svn propset svn:externals 'SmartDOMDocument http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' . svn ci . svn up |
Version History
0.4
- No longer using C14N() because it is causing bugs (such as <br> turning into <br></br>)
- PHP 5.2+ is no longer a requirement, due to the change above.
0.3.2
- test/example function added
0.3.1
- suppress warnings when loading HTML by default (this may change to use a setting later). This gets rid of "empty content", "unexpected tag", and other not well formed HTML warnings.
- add the standard trunk/tags/branches layout to the SVN repository.
0.3
- use a better, more portable method of dealing with encodings properly (thanks piopier).
0.2
- use the undocumented DOMDocument->C14N() if it's available (PHP 5.2+) to save exact HTML (really, PHP? We don't document extremely useful function anymore?).
0.1
- initial release.
References
How To Report Bugs
You have a few options here:
- Leave a comment here.
- Use the contact form on the About page.
- Send an email to admin [{at}] this site's domain.
- IM me using the Digsby widget on the right.
beer planet is a blog about technology, programming, computers, and geek life. It is run by Artem Russakovskii - a local San Francisco geek who currently works at
Hey man,
Thanks first off for sharing this!
I found a glitch when you try using saveHTMLExact() method with tags in the SmartDOMDocument, the <br /> tags get converted to <br></br> instead.
Also, if using SmartDOMDocument to convert code in-between
tags (such as code pastes from a form textarea) with htmlentities() and the code between the <pre> tags has a doctype in it, it get's stripped as well with saveHTMLExact().
Thanks again.
Thanks for the report, David.
I actually just experienced the same problem in my own project and saw what you're talking about.
Just fixed it in 0.4 – it's no longer using the C14N() function that was causing this, which also removes the PHP 5.2 requirement.
Give that one a go?
Hey man,
I was doing some more tinkering with DOMDocument and generating XHTML content and I figured a few things out.
First off, there's an excellent tutorial on making complete XHTML documents with DOMDocument at http://www.ultramegatech.com/blog/2009/07/generating-xhtml-documents-using-domdocument-in-php/ and it was from that I discovered this:
saveHTML doesn't output valid XHTML tags, but saveXML does. Never thought of that, but it fixes the issue with the tags <br /> being turned into <br></br> for one. In regards to that, check out LIBXML_NOEMPTYTAG constant @ http://us3.php.net/manual/en/libxml.constants.php …, look familiar?
I am curious how difficult it would be to either implement a saveXMLExact() method into your SmartDOMDocument class or if it would need to be a new class. Unfortunately saveXML() adds the tag to the top and that's a major inconvenience if using it to make only a certain part of a document. I was looking at using it for a form generator/validation class I had written to generate the whole form, but that tag is a problem.
Here's my example: http://pastebin.com/HhXMJnDm
Let me know what you think and whatever you come up with.
Thanks again.
Ты даже не представляешь как я благодарен тебе. Спасибо тебе огромное, брат.
I wish to try it but your link seems to be broken….
yes – the link is broken. can you please fix it?
thanks
http and svn download not working… very interested to try your code ;D
Same problem here.
Guys, while I'm fixing the SVN server, I put a copy of the code here: http://tinypaste.com/8911ac
I'll try to fix it asap but I am currently working through recompiling svn trunk due to a bug I need fixed which is not backported to any released version.
Artem, this is pretty smart, thanks for it.
I've been looking into making a DOMDocument/DOMElement implementation ON STEROIDS and this gave me the bit of inspiration I needed. I'm now using SmartDOMDocument in conjunction with a new class, SmartDOMElement, (which is registered into SmartDOMDocument in __construct via $this->registerNodeClass('DOMElement', 'SmartDOMElement'); ) – both will have enhanced functionality over their respective base classes.
So: Thanks!
What I get for the Example is:
After removing the image, the content is: <div class="class1">
Some Text
<p>& #1088;& #1091;& #1089;& #1089;& #1082;& #1080;& #1081;</p>
</div>
The image is: <img src="http://www.google.com/favicon.ico">
Is this correct? All Russian characters are encoded.
PS. This comment editor is really bad for entering HTML code.
You might want to change the regex a bit. Current one doesn't really work for me.
preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), ", trim($this->saveHTML()));
Last one messed up. Let's try again.
preg_replace(array('#^<\!DOCTYPE.*?.*?.*?#is', '#.*?$#si'), '', trim($this->saveHTML()));thanks so much it really helped me
Thanx – I was stuck with improper UTF for ages …
Hey, does this happen to fix the issue where DOMDocument will fix a broken html page (IE unclosed tags, the like)?
I really like what you have done here.
I personnaly wrote a NewDOMElement class that helps me create markup quickly (http://code.google.com/p/chaton-cms/source/browse/trunk/classes/DOMElement.php).
I think I will use your class in my future projects.
Thanks a lot man, I had bad UTF-8 encoding, now the display is very good and all I had to modify in my code after including your PHP class file was "new DOMDocument()" turned into "new SmartDOMDocument()".
[...] Click here for more info [...]