Hi there! You've reached the homepage of Ben Lavender! Ben is a computery person who partakes in many kinds of nerd tomfoolery, but mostly programming! You've found his home on the web, with the caveat that a lot his day-to-day stuff ends up on social networks, so this site makes him seem pretty stodgy.
Enjoy your stay.

BHUGA WOOGA!

Setting up your system for file conversions with File Framework

An important requirement of the platform for SPAWAR (for whom my employer, Openband, works) is a set of full-featured file functionality. Our solution to that is Miglius' File Framework, which is an exceedingly powerful solution for handling files on a Drupal 6 site. It reqires Arto's RDF Framework, which means it's easy for other modules to interact with it, and has a ton of other features, but the one I'm writing about today is the file conversion bits. Out of the box, FF supports logic for tons of conversion paths, so anything your users upload is downloadable as a preview, as a full file, playable in the browser, viewable as html, whatever. Importantly, it uses OpenOffice to do all of those nifty office format conversions: viewing Word documents in the browser, inline, is exceedingly useful.

It'd be silly to write that much conversion code yourself, and naturally we didn't want to reinvent the wheel. But that means there's a ton of software to install to make all of this work, from PHP 5.2 to JOD Converter, and that's what this post is about.

I will quickly get out of the way that this is not something intended for shared hosting. Good shared hosting will actually have a lot of this stuff installed already, and you'll just need to find the path, but documenting building all of this in non-privileged space is not what I'm after.

Also of note, all of this was done on CentOS 5. I think it would be easier on Debian or Ubuntu, as some of these software bits that needed to be built manually because of CentOS's occasionally dated packages. A great example is RHEL5/CentOS's inexplicable PHP 5.1.6. The resource framework requires PHP 5.2, and several months ago I rolled up a PHP 5.2.4 package based on the RHEL package, removing patches that had been fixed in the delta between 5.1.6 and 5.2.4 and keeping the rest. It works exactly like the RHEL/CentOS one does, except the version doesn't suck. If you're using CentOS or RHEL and PHP, you thus may want to download this PHP RPM.

First of all, you'll need the PECL Fileinfo package. This is a normal PECL installation:

# pecl install Fileinfo

Packages should exist for ffmpeg, catdoc, swftools, unrtf, and pdftotext (contained within poppler-utils on CentOS, and part of the Xpdf project):

 
# yum install -y swftools unrtf poppler-utils catdoc 

CentOS's Ghostscript package is simply too old, but briefly installing it will take care of the numerous dependecies. Snag a copy of the tarball and install like so:

# yum -y install ghostscript
# rpm -e ghostscript
# cd ghostscript-8.62
# ./configure --prefix=/opt/ghostscript --exec-prefix=/opt/ghostscript
# make
# make install

Note that this will require some path changes to the File Framework conversion configuration settings, unless you add Ghostscript to your web server's path.

Probably the most troublesome bit is OpenOffice. First you'll need a JRE, and a recent one; if your OS doesn't come with a 1.6 package it's easily obtained from Sun. Most importantly, you need a i386 package: OpenOffice is buggy on x86_64.

OpenOffice itself is probably the roughest part. The packages don't really support the idea of a 'headless converter' functionality, and CentOS isn't helping. As I keep my servers as slim as possible, I needed to install a ton of X libraries to get things going. In addition, OpenOffice requires a Java tzdata package that isn't available for CentOS, but the Fedora Core 8 version works fine.

OpenOffice.org.  The only packages that should fail are the gnome/kde bits.

To use that OpenOffice goodness, you'll need to run it like a daemon. I've slightly modified the init script found here to our use, in that can run as an unprivileged user, and in particular, as a user that the web server can work with. Copy and paste from below or wget this link:

#!/bin/bash
#
# chkconfig: 35 96 4
# description: Open Office
#

#Source function library.
. /etc/init.d/functions

export OOUSER=apache
export DISPLAY=0.0
export HOME=/home/${OOUSER}
export PATH=$PATH
export LANG=en_US.UTF-8

start() {
echo -n "Starting OpenOffice service: "
sudo -u $OOUSER env HOME=$HOME /opt/openoffice.org2.4/program/soffice -headless -accept="socket,port=8100;urp" -nofirststartwizard -display $DISPLAY &
### touch the lock file ###
touch /var/lock/subsys/soffice
success $"OpenOffice startup"
echo
}

stop() {
echo -n "Stopping soffice: "
killproc soffice
### Remove the lock file ###
rm -f /var/lock/subsys/soffice
echo
}

case "$1" in
start)
start
;;
stop)
stop
;;
status)
status soffice
;;
restart|reload|condrestart)
stop
start
;;
*)
echo $"Usage: $0 {start|stop|restart|reload|status}"
exit 1
esac

exit 0 

Lastly, you need a command line tool to interface with OpenOffice. JOD Converter handles that, and that's what we use. There's also a python version if that floats your boat. There's no RPM for it, but just snag a tarball from the downloads area, it's just a .jar file. I put it in /opt/jodconverter. Like ghostscript, this will require editing File Framework's conversion settings unless you put it in the web server's path.

Enjoy your multiply-converted file goodness!