Code.Art.Web

Code.Art.Web

Profile Picture

Lorefnon

Decrypting an obfuscated font with OCR

Abstract

A small demonstration to show that solutions attempting to protect against scraping through obfuscated fonts are not full proof.


I recently came across this site which demostrates a strategy for copy protection using an obfuscated font. So basically if you copy the text in the header you will realize that text copied is not exactly what is visible on the screen. The text is presented through a font that maps the characters to glyphs of a different character and hence though the output is human readable the underlying text is not. This is presented as a means towards protecting the site from crawlers and scrapers. In the post below I intend to demonstrate that this is not as full-proof as non-technical folks would be led to believe.

The Achilles heel of such strategies is that the unecrypted text is visible to the user. So a decent OCR will be able to parse the output and generate the text to reasonable accuracy. So though this technique certainly presents hinderences to a scraper it is not a full proof solution. Through simple code snippets we see below how this can be accomplished in ruby.

We can directly get the screenshot of the entire site using a tool like phantomJS and pass it on to OCR but the approach below is much more accurate.

First of all download the font used for the text. The path should be clearly visible in the source of the site and can be scraped using conventional text parsing through regular expressions.

Alternatively it can be manually downloaded through devtools:

Now we need two gems:

Quoting the Tesseract OCR Website:

Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.

Installing Tesseract on OS X is as simple as brew install tesseract and it is also available through several linux package managers.

The ruby gems can be installed by adding the following to your Gemfile:

gem 'magick_title', '>= 0.2.0'
gem 'tesseract-ocr'

Now in irb (or a ruby script) you can do:

require 'magick_title'
MagickTitle.options[:font] = "font_dfcb813d6c003fb3e2fca9f5295e9f58.ttf"  # Font downloaded from the site
MagickTitle.options[:font_path] =  Proc.new{ '/tmp' }  # Directory where the font resides on our system
MagickTitle.options[:destination] =  Proc.new{ MagicTitle.root } # Defaults to current directory

Now let us fetch the string from the site header. For the sake of keeping the example minimal we omit the actual scraping code here:

str =  "1zb SN358 y6JBl HJL 7Nagq JRb6 kzb n2vP 9JV. F + h = t."

MagickTitle.say(str) will give us an image containing the text in exactly the same form as is visible to end user.

Now we pass the image to Tesseract engine:

e = Tesseract::Engine.new {|e|
    e.language  = :eng
    e.blacklist = '|'
}

We can get the string using:

> e.text_for("./#{MagickTitle.say(str).filename}")
=> "The quick brown fox jumps over\nthe lazy dog 1 2 3\n\n"

As you can see that result is not perfect but quite accurate.

To further improve the performance of the solution we can create an image of every character through MagickTitle and pass it to Tesseract to generate a character by character map. Now large chunks of text can be translated using this map much more efficiently.

comments powered by Disqus
Separator line
Separator line
Lorefnon

Full stack web developer and polyglot programmer with strong interest in dynamic languages, web application development and user experience design.


Strong believer in agile methodologies, behaviour driven development and efficacy of open source technologies.


© 2013 - 2015 Gaurab Paul


Code licensed under the The MIT License. Content and Artwork licensed under CC BY-NC-SA.


The opinions expressed herein are my personal viewpoints and may not be taken as professional recommendations from any of my previous or current employers.


Site is powered by Jekyll and graciously hosted by Github