Web spam and Unicode private use

I just cleaned a piece of web spam only including some gibberish and a funny looking character. The character () was from the Unicode private use area. Interesting. I wonder if this is an attempt to find sites that are vulnerable.Warning! Speculation ahead.

I’m guessing that the tactic is to spam a load of forms and then just wait. The sites where the post gets published will likely get indexed by search engines. After a while, searching for the character will turn up those sites. Neatly sorted by influence (page rank) too. Then it’s just a matter of spamming them with the real payload.

Neither Google or Bing finds any results for the character at this time. I wonder if that’s because it’s unique at the time of indexing, or if it’s because they don’t allow searching for private use characters. I guess we’ll know when this post gets indexed. (Good thing I don’t allow comments!)

In any case, this kind of usage is definitely “private use”.

  1. Quick update:

    Both Bing and Google have indexed this blog post. Both search engines still don’t return any hits for the funny character. So it would seem they block searches for such characters.

    Interestingly, Bing strips the character from the summary as well. Google does show it in the summary.



