Setting up your system for file conversions with File Framework

An important requirement of the platform for SPAWAR (for whom my employer, Openband, works) is a set of full-featured file functionality. Our solution to that is Miglius' File Framework, which is an exceedingly powerful solution for handling files on a Drupal 6 site. It reqires Arto's RDF Framework, which means it's easy for other modules to interact with it, and has a ton of other features, but the one I'm writing about today is the file conversion bits. Out of the box, FF supports logic for tons of conversion paths, so anything your users upload is downloadable as a preview, as a full file, playable in the browser, viewable as html, whatever. Importantly, it uses OpenOffice to do all of those nifty office format conversions: viewing Word documents in the browser, inline, is exceedingly useful.

It'd be silly to write that much conversion code yourself, and naturally we didn't want to reinvent the wheel. But that means there's a ton of software to install to make all of this work, from PHP 5.2 to JOD Converter, and that's what this post is about.

Adventures in packet-munging with Telefonica

Good thing that so many ISPs are sullying their names lately, or I might have spent more time on this one.

One of our developers recently moved to Spain to work with some of our contractors for a few months, and he immediately had problems. He was having problems submitting certain forms on certain pages to a particular machine. There's got to be a server issue when both phpmyadmin and the application using that database are failing, and for everyone at the office. They had a connection with Telefonica, the local Spanish behemoth.

We played with it for a while. We looked at error logs, we tried other forms on other websites (another developer with our contractor could not log in to Rubyforge), and we had some other musings, but none of it went anywhere. I even spent a while messing with bandwidth measurements, since a TCP accelerator problem with one of our satellites in Africa caused very similar issues a few months back. Eventually, I took a tcpdump at the same time he did and got some strange results.

Developer's results:

13:24:33.798068 IP x.x.x.x.local.55864 > y.y.y.y.http: . ack 1 win 65535
13:24:33.798606 IP x.x.x.x.local.55864 > y.y.y.y.http: P 1:678(677) ack 1 win 65535
13:24:34.323942 IP y.y.y.y.http > x.x.x.x.local.55864: . ack 678 win 65535
13:24:34.324056 IP x.x.x.x.local.55864 > y.y.y.y.http: . 678:2138(1460) ack 1 win 65535
13:24:34.493515 IP y.y.y.y.http > x.x.x.x.local.55864: . ack 678 win 6579
13:24:34.957342 IP y.y.y.y.http > x.x.x.x.local.55864: . ack 678 win 6579
13:24:35.312801 IP x.x.x.x.local.55864 > y.y.y.y.http: . 678:2138(1460) ack 1 win 65535
13:24:35.944393 IP y.y.y.y.http > x.x.x.x.local.55864: . ack 678 win 6579
13:24:37.325466 IP x.x.x.x.local.55864 > y.y.y.y.http: . 678:2138(1460) ack 1 win 65535

My results:

13:24:34.039365 IP x.x.x.x.21483 > y.y.y.y.www: . ack 1 win 5840
13:24:34.044362 IP x.x.x.x.21483 > y.y.y.y.www: . 1:732(731) ack 1 win 65535
13:24:34.044378 IP y.y.y.y.www > x.x.x.x.21483: . ack 732 win 6579
13:24:34.494733 IP x.x.x.x.21483 > y.y.y.y.www: . 2126:2192(66) ack 1 win 65535
13:24:34.494745 IP y.y.y.y.www > x.x.x.x.21483: . ack 732 win 6579
13:24:35.494664 IP x.x.x.x.21483 > y.y.y.y.www: . 2126:2192(66) ack 1 win 65535
13:24:35.494670 IP y.y.y.y.www > x.x.x.x.21483: . ack 732 win 6579

Huh! Why are the packet sizes changing by the time it gets to me from him? This is where I can thank our friendly ISP's for their incompetence in keeping their packet-adjusting programs transparent: I would never have guessed, even 6 months ago, that ISPs would so blatantly screw with their customer's packets just to have a story to tell at the pub later. Thanks to recent highly publicized displays of unconscionable behavior by ISP's, I needn't convince myself that the rest of the world is wrong, and I have a router problem or something. ISP's really are that bad.

Fortunately, our developers' office was switching from Telefonica to Jazztel the next day, for a bandwidth upgrade. We could see if that fixed the problem.

So what happened the next day?

Developer's results:

11:46:15.199847 IP x.x.x.x.local.63182 > y.y.y.y.http: S 4092983779:4092983779(0) win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 47429206 0,sackOK,eol>
11:46:15.339426 IP y.y.y.y.http > x.x.x.x.local.63182: S 1014263802:1014263802(0) ack 4092983780 win 5792 <mss 1380,sackOK,timestamp 663357440 47429206,nop,wscale 7>
11:46:15.339497 IP x.x.x.x.local.63182 > y.y.y.y.http: . ack 1 win 65535 <nop,nop,timestamp 47429207 663357440>
11:46:15.339813 IP x.x.x.x.local.63182 > y.y.y.y.http: P 1:678(677) ack 1 win 65535 <nop,nop,timestamp 47429207 663357440>
11:46:15.484783 IP y.y.y.y.http > x.x.x.x.local.63182: . ack 678 win 56 <nop,nop,timestamp 663357478 47429207>
11:46:15.484855 IP x.x.x.x.local.63182 > y.y.y.y.http: . 678:2046(1368) ack 1 win 65535 <nop,nop,timestamp 47429209 663357478>
11:46:15.639983 IP y.y.y.y.http > x.x.x.x.local.63182: . ack 2046 win 79 <nop,nop,timestamp 663357517 47429209>
11:46:15.640083 IP x.x.x.x.local.63182 > y.y.y.y.http: . 2046:3414(1368) ack 1 win 65535 <nop,nop,timestamp 47429210 663357517>
11:46:15.640129 IP x.x.x.x.local.63182 > y.y.y.y.http: . 3414:4782(1368) ack 1 win 65535 <nop,nop,timestamp 47429210 663357517>
11:46:15.799374 IP y.y.y.y.http > x.x.x.x.local.63182: . ack 3414 win 102 <nop,nop,timestamp 663357555 47429210>
11:46:15.799473 IP x.x.x.x.local.63182 > y.y.y.y.http: . 4782:6150(1368) ack 1 win 65535 <nop,nop,timestamp 47429212 663357555>
11:46:15.799511 IP x.x.x.x.local.63182 > y.y.y.y.http: P 6150:6209(59) ack 1 win 65535 <nop,nop,timestamp 47429212 663357555>
11:46:15.819146 IP y.y.y.y.http > x.x.x.x.local.63182: . ack 4782 win 124 <nop,nop,timestamp 663357560

My results:

11:46:15.272449 IP z.z.z.z.63182 > y.y.y.y.www: S 1042651241:1042651241(0) win 65535 <mss 1380,nop,wscale 3,nop,nop,timestamp 47429206 0,sackOK,eol>
11:46:15.272466 IP y.y.y.y.www > z.z.z.z.63182: S 37547187:37547187(0) ack 1042651242 win 5792 <mss 1460,sackOK,timestamp 663357440 47429206,nop,wscale 7>
11:46:15.408746 IP z.z.z.z.63182 > y.y.y.y.www: . ack 1 win 65535 <nop,nop,timestamp 47429207 663357440>
11:46:15.421489 IP z.z.z.z.63182 > y.y.y.y.www: P 1:678(677) ack 1 win 65535 <nop,nop,timestamp 47429207 663357440>
11:46:15.421505 IP y.y.y.y.www > z.z.z.z.63182: . ack 678 win 56 <nop,nop,timestamp 663357478 47429207>
11:46:15.578903 IP z.z.z.z.63182 > y.y.y.y.www: . 678:2046(1368) ack 1 win 65535 <nop,nop,timestamp 47429209 663357478>
11:46:15.578914 IP y.y.y.y.www > z.z.z.z.63182: . ack 2046 win 79 <nop,nop,timestamp 663357517 47429209>
11:46:15.731438 IP z.z.z.z.63182 > y.y.y.y.www: . 2046:3414(1368) ack 1 win 65535 <nop,nop,timestamp 47429210 663357517>
11:46:15.731444 IP y.y.y.y.www > z.z.z.z.63182: . ack 3414 win 102 <nop,nop,timestamp 663357555 47429210>
11:46:15.751551 IP z.z.z.z.63182 > y.y.y.y.www: . 3414:4782(1368) ack 1 win 65535 <nop,nop,timestamp 47429210 663357517>
11:46:15.751557 IP y.y.y.y.www > z.z.z.z.63182: . ack 4782 win 124 <nop,nop,timestamp 663357560 47429210>
11:46:15.892222 IP z.z.z.z.63182 > y.y.y.y.www: . 4782:6150(1368) ack 1 win 65535 <nop,nop,timestamp 47429212 663357555>

What a shocker--fairly damning proof of another ISP running some sort of ridiculous filtering/inspection/shaping/proxying software. Except this stuff isn't even good enough to be invisible: it couldn't even put the packets back together properly. My day is basically spent trying to define <=> across the set of priorities "A", "Top", "1", "A1", "1A", "First", "Primary", and "Pretty Important", and this stuff makes me exceedingly frustrated. Where can I send an invoice for my time to companies that make me waste 2 hours to discover that we're not even dealing with IP anymore, but some embraced and extended version of it?

I hope that one day it will be profitable enough to run a company that actually provides internet service instead of an ad pipeline. I know I'd pay more. In the meantime, nothing to do but grind our teeth until ad replacement by ISP's makes the internet no longer profitable.

The Drupal-Based Toolchain

I made a choice to go into system administration about a year ago, after a hiatus from web development. I'm getting back into coding again in a hurry, and rebuilding my toolchain is a pretty painful process. The biggest part of that toolchain has been Drupal 6, and it's both a wonderful revelation and a hugely frustrating experience.

This will make more sense with some background. I stopped doing web development in '03, after using the Perl on Rails framework, which is exactly what we did not call the combination of Perl 5, Class::DBI, Template Toolkit, and CGI::Application those days, long before Rails was a blip on the radar. I got out of it because I saw I wasn't capable of doing bigger projects alone, not with those tools, and I supported myself on making minor updates to other applications and the occasional small-scale bookkeeping app. When I came back to trying to find some full-time work, I went for system administration, which involves more interaction with people, or at least it used to--coding is much more social than it used to be.

I lucked into a job running one of the most complicated Drupal installations out there, with some 600k lines of code across 130 modules. I work with some terribly smart people, and we're doing some very cool stuff. It took me quite some time to start getting into the Drupal stuff itself, but once I started to figure it all out, it clicked pretty well. It's a welcome kind of uncomfortable not to be the smartest person in the room.

My previous toolchain consisted of some Perl's copious libraries, the windows ports of MySQL and Apache, cygwin, and vi. In hindsight, this was childish, pathetic, comical, the kind of toolchain that wears clown shoes. It still let me do an awful lot, compared to the scripts I was writing in '99 that amounted to little more than CGI to SQL wrappers, but it was clear to me how limited they were.

What a difference a few years makes: now I'm on a Macbook Pro with a ton of handy Macports, a native set of system tools equivalent to or better than the Linux equivalents I'm used to, and tons of useful apps, all set at the low, low price of $20-$40 apiece, because the cult of Mac is a true religion, complete with tithes. I'm finally getting practice with source control in the kind of complicated environment where it matters, and the devs around me know a ton of nifty tricks and tools. All of that stuff is a vast improvement, but the single biggest change to the toolchain is Drupal.

Before I get into what I like and don't like, there's a question about how one should see Drupal: as part of a toolchain, or as an end-user piece of software. It's designed, from the ground up, to let people start with nothing and end with a website, so perhaps considering it just a link in a chain is inappropriate, but I don't think so. No website, no matter how little code is involved, should be considered to exist in a vacuum, that's not how the web works today. Each site should be considered a node on a graph in addition to a site in and of itself, just like a library written for a project should be considered in the scope of reusability. In this respect, I'm weighing Drupal against web frameworks like Ruby on Rails or Django as opposed to other CMS's, such as that recurring villain of the Drupal graphic novel, Wordpress.

At any rate, wow oh wow, what an improvement! I really used to write menu templates that had to check for what page was currently being loaded and make that option have 'current page' css? Really? Write SQL for anything that required a join and an order or limit at the same time? Really? Write out form HTML? Hand-write javascript for the most basic form validation? Write templates at all? Drupal lets me worry about coding and not display, and that makes me actually interested in writing software again. It's an amazing tool, and as the browser becomes a platform, it's great to have so many of the basic tools in the web designer's toolkit be done better than most desktop client frameworks do desktop.

So the move to Drupal really is a groundbreaker for me, and now I'm busy with web development and writing silly blogs when I get home from busy days of the kinds of things sysadmins fill their days with. Unfortunately, I still see a lot of problems. The gains make it all very worth worthwhile, but some of the parts of Drupal are exceedingly frustrating when it's seen as a strong link in a toolchain and not an end-all, be-all website.

The first place this toolchain could be improved is an automated module install system. There are about 40 competing solutions, but on the whole, it's ridiculous that I can't tell my website to install the google analytics module into itself. The Drupal modules site doesn't really help a newbie find the kinds of modules that every site should have, things like pathauto and google analytics. It's a library distribution system that's technologically behind pear, ruby gems, easy install, and even behind CPAN, which is about a decade old. There's a lot of competing solutions to this problem, and hopefully one emerges as a workable base soon.

I find that some parts of the system are configured in strange places. Blocks, for example, are edited in a special area for blocks. That's appropriate. The location a block appears on, including page-by-page exceptions, is edited with the content of the block. Meanwhile, the node's own page selects menus, URL aliases, and publishing options, just inline options if the viewer is an administrator. While the ability to make a block display based on some PHP code is perfect sometimes, other times, it's more maintainable for a page to control what blocks it has, and not the other way around. Besides, it's rarely appropriate to give a non-administrator PHP rights. Breaking up the configuration of a page like this makes it harder to reuse any special configuration or code that particular page might have. I think a few instances of this kind of change might go a long way towards maintainability.

Hand in hand with maintainability is resuability. I have a lot of experience with this as a Drupal sysadmin. All of those great CCK types and views and whatnot require significant overhead to export and import properly, and keeping them in version control is even more difficult. It's a huge problem when you manage as many sites as we do, and lots of people are trying to solve it: there's a lot of good work going on in the change management group, tools like CoCKtail solve the problem by letting CCK types be a simple bit of text, and the ephemeral autopilot threatened to put the entire databases under version control to deal with the problem.

That problem looks like it's being solved: good. It's a shame that so much work is being spent on the simple problem that Drupal litters the definition of a site across code, a database, and a filesystem, and that it does it in a more or less unrepeatable way. Point to a particular file in a site's files directory: is it in use, or is it some poor lost inode, left adrift from 2 upgrades ago? Best just to leave it be.

None of these problems are terribly hard to solve for D7--it's just a point of view change. And really, at the end of the day, that's a long blog post for a bunch of problems that aren't going to keep me from picking something else, so let's not assume anyone's about to abandon all of those useful contributed modules in order to do something drastic. Let's just keep an open mind.

Syndicate content