<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Muziboo Development Blog &#187; gotcha</title>
	<atom:link href="http://devblog.muziboo.com/category/gotcha/feed/" rel="self" type="application/rss+xml" />
	<link>http://devblog.muziboo.com</link>
	<description>Muziboo development stories. Mostly set in ruby land</description>
	<lastBuildDate>Fri, 01 Apr 2011 04:52:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Attachment_fu sanitize filename, Regex and Unicode gotcha</title>
		<link>http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/</link>
		<comments>http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/#comments</comments>
		<pubDate>Tue, 17 Jun 2008 09:19:17 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[attachment_fu]]></category>
		<category><![CDATA[gotcha]]></category>
		<category><![CDATA[ruby on rails]]></category>
		<category><![CDATA[sanitize filename]]></category>
		<category><![CDATA[unicode]]></category>

		<guid isPermaLink="false">http://prateekdayal.net/tech/?p=14</guid>
		<description><![CDATA[
			
				
			
		
Attachment_fu sanitizes the filenames on uploads to remove any funky character (not 0-9 a-z A-Z, underscore or a period). This is accomplished by the sanitize_filename private method in attachment_fu.rb file

1
2
3
4
5
6
7
8
9
10
def sanitize_filename&#40;filename&#41;
  returning filename.strip do &#124;name&#124;
    # NOTE: File.basename doesn't work right with Windows paths on Unix
    # get [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdevblog.muziboo.com%2F2008%2F06%2F17%2Fattachment-fu-sanitize-filename-regex-and-unicode-gotcha%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdevblog.muziboo.com%2F2008%2F06%2F17%2Fattachment-fu-sanitize-filename-regex-and-unicode-gotcha%2F&amp;source=muziboo&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Attachment_fu sanitizes the filenames on uploads to remove any funky character (not 0-9 a-z A-Z, underscore or a period). This is accomplished by the <em>sanitize_filename</em> private method in <em>attachment_fu.rb</em> file</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#9966CC; font-weight:bold;">def</span> sanitize_filename<span style="color:#006600; font-weight:bold;">&#40;</span>filename<span style="color:#006600; font-weight:bold;">&#41;</span>
  returning filename.<span style="color:#9900CC;">strip</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>name<span style="color:#006600; font-weight:bold;">|</span>
    <span style="color:#008000; font-style:italic;"># NOTE: File.basename doesn't work right with Windows paths on Unix</span>
    <span style="color:#008000; font-style:italic;"># get only the filename, not the whole path</span>
    name.<span style="color:#CC0066; font-weight:bold;">gsub!</span> <span style="color:#006600; font-weight:bold;">/</span>^.<span style="color:#006600; font-weight:bold;">*</span><span style="color:#006600; font-weight:bold;">&#40;</span>\\<span style="color:#006600; font-weight:bold;">|</span>\<span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">/</span>, <span style="color:#996600;">''</span>
&nbsp;
    <span style="color:#008000; font-style:italic;"># Finally, replace all non alphanumeric, underscore or periods with underscore</span>
    name.<span style="color:#CC0066; font-weight:bold;">gsub!</span> <span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#91;</span>^\w\.\<span style="color:#006600; font-weight:bold;">-</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">/</span>, <span style="color:#996600;">'_'</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></td></tr></table></div>

<p>The shortcut \w is described <a href="http://www.ruby-doc.org/docs/UsersGuide/rg/regexp.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.ruby-doc.org');">here</a> as letter or digit; same as [0-9A-Za-z]. However since ruby regex engine has support for unicode, letter means any unicode character. So it will let characters like 爱与希望 remain. This can be a problem if you are passing a filename containing such characters to a flash player. The flash player just won&#8217;t play the file!</p>
<p>A quick solution would be to check specifically for 0-9A-Za-z. This can be done by changing the function to</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#9966CC; font-weight:bold;">def</span> sanitize_filename<span style="color:#006600; font-weight:bold;">&#40;</span>filename<span style="color:#006600; font-weight:bold;">&#41;</span>
  returning filename.<span style="color:#9900CC;">strip</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>name<span style="color:#006600; font-weight:bold;">|</span>
   <span style="color:#008000; font-style:italic;"># NOTE: File.basename doesn't work right with Windows paths on Unix</span>
   <span style="color:#008000; font-style:italic;"># get only the filename, not the whole path</span>
   name.<span style="color:#CC0066; font-weight:bold;">gsub!</span> <span style="color:#006600; font-weight:bold;">/</span>^.<span style="color:#006600; font-weight:bold;">*</span><span style="color:#006600; font-weight:bold;">&#40;</span>\\<span style="color:#006600; font-weight:bold;">|</span>\<span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">/</span>, <span style="color:#996600;">''</span>
&nbsp;
   <span style="color:#008000; font-style:italic;"># Finally, replace all non alphanumeric, underscore or periods with underscore</span>
   <span style="color:#008000; font-style:italic;">#            name.gsub! /[^\w\.\-]/, '_'</span>
   <span style="color:#008000; font-style:italic;">#            Basically strip out the non-ascii alphabets too and replace with x. You don't want all _ :)</span>
    name.<span style="color:#CC0066; font-weight:bold;">gsub!</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#91;</span>^<span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">-</span>9A<span style="color:#006600; font-weight:bold;">-</span>Za<span style="color:#006600; font-weight:bold;">-</span>z.\<span style="color:#006600; font-weight:bold;">-</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">/</span>, <span style="color:#996600;">'x'</span><span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></td></tr></table></div>

<p>Finally this is not a problem if non ascii characters don&#8217;t cause any issue in your site.</p>
]]></content:encoded>
			<wfw:commentRss>http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

