This problem starter as a rather simple one — I needed to send HTML emails, using Perl, to a remote SMTP server with user authentication and TLS support (STARTTLS). The extra “gotcha” is that I refused to read books about MIME structures. Easy enough right? It turns out, it can be if you know the right answer.
Let start with the answer for those that are impatient – it turns out this is the easiest way to achieve this:
# Pull HTML content down from URL
# Create Email itself - hard part, in HTML use Email::Stuffer;
# The "->email" is key because it returns a Email::MIME object which is needed for the send
[updated: Jan 2nd, 2013 | Updated ‘tiny.pl’ to skip “mailto:” references … it polluted emails with replies]
[updated: Nov 28th, 2012 | Updated ‘tiny.pl’ to be much more efficient which resulted in a crazy speedup]
Yesterday I received one too many emails with a long URL that I actually needed to click on. Why is this a problem you wonder? — I use Mutt. Yes, I’ve heard it all, and yes, I don’t think it’s the best email client, but when you spend all day in a terminal, it’s simply a pain launching the browser to send a single email. It’s a “ctrl+a, #, m, type…type, y…sent”, vs “open browser, go to url, log-in, compose, type…type, hit send”. Anyway, given that I am using the Apple Terminal.app which for some reason has not been upgraded in the last 8 years to include hot linking URLs everywhere (correction: It does, but it does not always handle right clicking multi-line links which contain “strange characters and symbols”), I have to suffer. I’ve been toying with the idea of parsing my mutt emails for a while now, and yesterday I finally decided to sit down and write something. My starting point was my .mailcap entry
My first thought was, why not hook a custom perl script to parse the text from the email, extract the urls, and shorten them? After a little bit of work, I realized that I care about the rest of the text too and not just the URLs. The final solution can be found here:
The script starts by NOT reinventing the wheel, and utilizing lynx to parse out the HTML. Please note that this is done only for the ‘text/plain’ type. The way the same script is overloaded is by supplying a second argument for non-html based emails. As you will see, I actually use elinks to parse for html instead of lynx, and the reason for that is because lynx introduces a new character on long URLs for some reason, when used with –dump. This created problems with shortening the URLs. Then the script splits the resulting output into lines, and then it splits each line into “words”. Each “word” is checked for being an URL. If it is, and it’s less than the “trigger” number of characters, it’s shortened and printed, along with the original (nice to keep track). Otherwise, the word is just printed and the process repeats until the end.
While this is EXTREMELY simple concept wise, it is very useful. Is there a down side? — Yes — some potentially private URLs are now public. Solution? — Yes — sign up for bit.ly Pro (free) and use your own domain name. At last, I just want to tack on that while searching for an existing solution to this, I did find a program called “urlview”. I haven’t tried it, but it seems like a much better solution. Here’s some more information on it: http://linuxcommand.org/man_pages/urlview1.html
UPDATE: As it turns out, Terminal.app actually picks up some/most/(all?) long urls. I think it was ‘lynx’ that was wrapping the line at 65-72 characters, which ended up being the cause for the ‘+’ in front of long URLs and the break onto two lines. Anyway, so basically, this means that if you don’t use ‘lynx’ for the HTML parsing, you can potentially click on the links. Either way, I still prefer having tinyurls. Also, I did find a bug in the ‘t’ (non-html emails) version of the script, where in some cases, it will rip out the URL but now show it at all (original or shortened). I noticed this 2 times (out of 1000+ emails), but I just haven’t had time to look into it. I have a feeling that it’s not really my script. The email comes in not containing any html other than an html a href tag. I think that messes with the detection.