Hacking Hreflang Sitemaps for Special Character Solutions

This post helps you create Hreflang XML sitemaps using our Hreflang sitemap tool, which incorporates URLs that use non-Latin characters in languages such as Chinese or Arabic. This content is aimed at technical practitioners who already have experience in using Hreflang sitemap solutions.

Currently there’s a limitation with Hreflang XML sitemaps, and therefore our Hreflang sitemap tool; which is related to the character encoding of the URL to be mapped. According to Google on non-alphanumeric and non-Latin characters in sitemaps:

“A sitemap can contain only ASCII characters; it can't contain upper ASCII characters or certain control codes or special characters such as * and {}. If your Sitemap URL contains these characters, you'll receive an error when you try to add it.”

We know this has been a frustration for some of our community working on very large global suites in Aramaic and Asian languages particularly. So here’s the workaround:

We need to get the UTF-8 encoded URL version of the original URL that contains the special characters. This encoded URL will contain the standard entity escape codes for such characters. For a small implementation you can get this from Screaming Frog as per below; however there’s no export feature for encoded URLs, hence why I suggest smaller implementations as you can do a copy paste job here.

Screaming Frog sowing Chinese Characters in URLs

You can then ‘hand-stitch’ these encoded URLs into their respective places in your CSV input file and plug it into our Hreflang tool, to generate your XML sitemaps as per normal.

If you’re working on a larger implementation and your site uses OpenGraph protocol then there is another way to get this out of Screaming Frog in bulk, which is by using a custom extraction.

First check to see if the following meta tag is present…

<meta property="og:url" content="encoded URL here" />

If so, you can use the following line of custom extraction with xpath:

//meta[starts-with(@property, 'og:url')][1]/@content

Set it to "Extract HTML Element" and it adds another column to your crawl on Screaming Frog when you export the data. Alternatively there are a couple of tools that can encode and decode URLs that are out there including this one and that one.

Whilst not the most elegant of solutions this will work to help you create a full set of sitemaps for non-Latin URLs occasionally found in large global site set-ups.

Special thanks to Pete for perseverance on finding this workaround!

Mar 03 2016
Nichola Stott
Nichola Stott

Join the Discussion

Recent Articles