Enhancing your submission to google by creating a google sitemap.
Google have a new technology which allows you to create a complete sitemap in xml using a simple python script. The technical discussion at google is very good, but I thought I would explain the many simple problems that you may have.
Well, I thought a good simple example to back up their excellent tutorial would be useful.
Firstly, you do need a web site and access to this, to even bother contemplating a sitemap of your web site. The obvious bit out of the way, your site also needs to be running a language called python, and finally you must be allowed to run a simple job called a cron job - this is a regular job that can be run at set intervals. More about this in a second.
OK, you will need to visit the google sitemap page (or sourceforge.net) for the sitemap generator files. I could add them to this site, but at least you know where you are getting the code from.
Download the zip version if you have windows, or the .gz version for linux. I will only talk about windows, but both are similar.
Unzip this to an area on your pc, e.g. to the c: drive, and you will have a folder and files named :
C:\sitemap_gen-1.0 \
sitemap_gen.py (the python script executable)
example_config.xml (a read only file)
These are the only two files you need bother yourself with.
This is all you need to do:
Save the example_config.xml as config.xml - this makes it a writeable file
Open the config.xml in a suitable text editor. I use textpad from textpad.com - it is free to evaluate as long as you wish, and is inexpensive to purchase. Also double click on the same file and it opens in your browser window. Now you can see what is commented in the file, and what is relevant. Try it.
I cut my file down to the bare basics for this example as config.xml :
<?xml version="1.0" encoding="UTF-8"?> <!-- ** MODIFY ** The "site" node describes your basic web site. --> <site base_url="http://dirtiboy.com/" store_into="sitemap.xml" verbose="1" > <!-- *** FILTERS *** --> <!-- Exclude URLs that point to UNIX-style hidden files --> <filter action="drop" type="regexp" pattern="/\.[^/]*$" /> <!-- Exclude URLs that end with a '~' (IE: emacs backup files) --> <filter action="drop" type="wildcard" pattern="*~" /> <!-- Exclude URLs that point to default index.html files. URLs for directories get included, so these files are redundant. --> <filter action="drop" type="wildcard" pattern="*index.htm*" /> </site>All you need do is cut and paste this example into your text editor, save it as config.xml and edit the file to change dirtiboy.com to the name of your site. That's all there is to creating this file.
Last steps are to upload these two files to your site, to the root of the site, NOT into the public_html folder (also known as www).
Then lastly we need a cron job set up. Have a quick search of the support pages for your site to confirm where the python executable lives.
Mine is at /usr/bin/python and is worth a try if you are not sure, so you will need to add a cron job with the following line
/usr/bin/python sitemap_gen.py --config=config.xml --testing
Once it works, remove the --testing part of this line
If you are unfortunate, that you use a crappy web site provider, e.g. ipowerweb.com then move to someone better. They don't run cron jobs, although it is part of their package.
*nb* One tiny addition, which I have not thoroughly tested is a problem if you use frontpage as your html editor. If this is the case, then lots of files with the name _vti_ are included in the sitemap.xml
A quick solution is to add a couple of lines in the filters section of config.xml, like this, the first is a comment:
<!-- Exclude _vti_ files from frontpage -->
<filter action="drop" type="wildcard" pattern="*_vti_*" />
Oh, yes, when you set up a cron job (from the cpanel of your web site) remember to add an email address, you then get confirmation sent to you every time the job runs
I recommend lunarpages.com as an excellent service provider.
Everything works, and they are contactable, and very friendly.
I can be contacted at kevan if you hit problems; or post a blog