Finding duplicate files with Go

For some time now, I’ve been messing around with the Go programming language. I like the idea of a systems language with a modern take on the standard library (like build in support for JSON, HTTP calls and built in vectors called slices), concurrency, easy use of 3rd party dependencies and fast, clean code.

As a fun experiment of how easy it is to traverse a filesystem, hashing files, checking sizes on files, parsing arguments among others I’ve implemented a small and simple tool for finding duplicate files in a list of directories (supplied as arguments to the application).

It goes something like this:

  1. Gather all filenames and sizes in a map, where the key is the size of the files and the value is a slice of filenames as strings.
  2. For all the sizes with multiple filenames, hash up to the first 1024 bytes of the files and add to a similar map.
  3. For all hashes with multiple filenames, print them as duplicates and let the user choose what to do.

The code (in the newest version) can be found here: https://bitbucket.org/dennishedegaard/duplifinder/src/master/duplifinder.go

In my daytime job I usually spend my time programming python. Python is a nice language, but no language is perfect. It is extremely nice for writing fast and working code, making very readable code and in general doing things at a higher level of abstraction.Go on the other hand seems to have a much larger emphasize on clean code, being nearer the metal and having more control.

Examples of the difference is clear when building Go code, it will not run if you have a variable declared that is not used somewhere. Same goes for an import not being used or types being wrong. Python on the other hand will run pretty much anything and stops only at the first bad line of code, this makes it hard to validate if there’s a syntax error in the code. Python has 3rd party tools for checking these things (pyflakes, pylint, pep8 etc), but they do not catch nearly as much as the build in validator in Go.

Go is a statically typed language, it however infers types like many other modern languages, python is dynamically typed and it can be a challenge to figure out what a variable is pointing to in a large codebase.

Another difference is the fact that Go has no exceptions (it has panics but they are used differently, more like serious crashes), instead most methods that can fail return 2 results (like when you return a tuple in python), it returns the result, and an error. If an error occured the result is usually nil while the error is an error-object and vice versa.

When you write a Go program you usually run it by calling “go run <file>”, this will build and run the program. The you’re ready to deploy you can simply call “go build <file>” and a binary is built. I have tried moving binaries between systems without a Go runtime, and they still work.

I will probably keep messing around with Go, especially since it’s so different from python, I’ve always had a weak spot for system languages like C and C++ but hated the somewhat small standard libraries and APIs and the constant checking for bad pointers and memory leaks with valgrind every now and then. Go seems to have the perfect blend of simplicity, power and expressiveness without the history that shaped older languages.

Go also features API for doing web development, for now I’ve done most of that using the web.go framework, it seems a bit far away from my usual framework (Django).

I will probably keep on coding Go for various projects in the future, it’s a nice language with some new and interesting ideas. I tried rust as well, but the unstable API and lack of documentation is still keeping me away for now.

 

Django, Memcached and the EZTV twitter feed

Back in the days I wrote a HTML-page that used javascript, AJAX and jsonp to parse the eztv-it twitter timeline to something useful (can be found here: http://eztv-mirror.appspot.com/), unfortunately it does not work anymore.

As a fun project I considered implementing a replacement in Django, using the django caching framework and some “nice to know” libraries (eg requests).

The main problem with the old version is that it does not receive a response from the twitter REST api. When you hit the URL directly you however, get a response. I suspect the problem has to do with the callback used to wrap the content from the server side not working correctly (not wrapping the json response in the callback function or similar).

Using the python requests library I was able to get the data I needed, the data can be found on the following URL:
https://api.twitter.com/1/statuses/user_timeline.json?screen_name=eztv_it

Since this project is about farmiliarising myself with a lot of the “nice to know” libraries commonly used, the list is the following (from requirements.txt):

  • Django==1.5.1 – Obviously :)
  • python-dateutil==2.1 – Dateutil has a nice parser method, that attempts to parse the date using common formats, this way I most likely won’t notice if Twitter changes their datetime format.
  • pytz==2013b – required by python-dateutil, but Django uses it internally as well (if available).
  • requests==1.2.0 – used to get the data from Twitter, it’s a wrapper around the urllib/urlib2/httplib mess.
  • python-memcached==1.51 – One of the two popular memcached binding for python.
  • django-memcached==0.1.2 – Django bindings to the python memcached module (above).

Most of the above is convenience, to reduce the number of lines required, it makes the code easier to read and reduces the chance of bugs.

In django you define the caching used in the settings module using something similar to (example is local memcached, over TCP):

CACHES = {
 'default': {
 'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
 'LOCATION': '127.0.0.1:11211',
 }
}

The backend can be switched to something else that implements the django cache interface, without much notice. You can also use multiple caching backends simultaneous.

The django caching framework is extremely easy to use, here’s my take on it (from app/eztv.py):

from django.core.cache import cache
...
# Key of the cache.
CACHE_ENTRY_NAME = 'eztv_cache'
# Timeout of the cache (in seconds)
CACHE_ENTRY_TIMEOUT = 600
...
def update_cache():
    ...
    data = list(yield_data())

    cache.set(CACHE_ENTRY_NAME, data, CACHE_ENTRY_TIMEOUT)
    return data

def get_cache():
    ...
    _cache = cache.get(CACHE_ENTRY_NAME)

    # If cache is empty or outdated, update it.
    if _cache is None:
        _cache = update_cache()

    return _cache

As can be seen, the django cache framework is a simple import (after defining the backend in settings) and works along the lines of a key/value store, where the elements  can have a timeout.

The backend can even be memory-only (normally only for development) or simply use files (as keys) on the filesystem.

One of the caveats to the method I use is of course the problem when the cache is outdated and needs to be refreshed, since this takes several seconds. One way to solve this is by using a cronjob for updating the cache. This however brings more complexity and more dependencies to the system as a whole.

It coult be interesting to try and send 100 requests quickly after the cache in outdated or missing. I do not know if this is a problematic case that django solves for me by keeping track of get/set calls to the cache per request, or if it’s a problem I have to solve myself. It is however easy to prove.

Git repository can be found here: https://bitbucket.org/dennishedegaard/eztv/
A running site can be found here: http://ez.dhedegaard.dk/

Raspberry Pi

I finally got my Raspberry Pi in the mail. It’s a small, extremely cheap ARM-based computer, I see it as a fun gadget to mess around with. The official operation system for one of these is a modified version of the Debian GNU/Linux operating system called Raspbian, this means the software is something I already know very well.

Here’s a pic of the system up and running:

IMG_20130402_182958

And here’s a screenshot of the desktop running, Raspbian uses a slightly modified LXDE desktop

raspberrypi

One of the immediate annoyances is the need for 700 mA at 5 Volts. The maximum amount of mA from USB2 is 500 mA at 5 Volts. The 700 mA at 5 Volts means the top effect is around 3,5 watts. When the system is idle it seems to use around 2 watts. This is very impressive for a computer running a modern operation system.

zram and the memory-problem with virtualization

I like to run a lot of VM’s, whenever I start work on a non-trivial project I usually make a VM to isolate the system a bit. This means I run a lot of VMs from time to time.

One of the problems with virtualization is the increase in memory usage, one of the ways I’ve tried to counter this is by using KSM (Kernel SamePage Merging) which merges pages in memory containing the same data. My server has 8 GB of memory, this saves me about 1 GB in average.

A friend of mine keeps talking about how awesome zram is so I gave it a shot. What it does is allocate memory to a compressed block device, this block device can then be used for swap. Swapping from/to compressed memory is super-fast compared to traditional swapping. One of the problems with zram is an obvious processing-overhead caused by the constant compression/decompression of pages to/from the swap.

Here’s a short explanation of the things I did to enable zram.

Upgrading Debian

I like to run stable software (especially for the system my hypervisor(KVM) lives on), and it’s hard to get more stable than a Debian stable (CentOS anyone?), this means running a 2.6.32 kernel (in the case of squeeze/6.0). Since wheezy (or 7.0) is currently in RC1 I decided to upgrade. Needless to say this went smooth. I now run a 3.2 kernel.

Modprobing the module

For debian all I need to do to modprobe at boot is to add “zram” to /etc/modules. One of the things you’d want to give the module as parameter when probing is zram_num_devices, this tells zram how many devices you want. Usually you want as many zram-devices as the number of CPUs on the system.

On debian this is done by making a file in /etc/modprobe.d and entering something like (in my case 2 cores):

options zram zram_num_devices=2

Initializing the SWAP on boot

Since the content of DRAM (Dynamic RAM) is lost when the module are unpowered you need to make new swap every time you boot. There are lots of nice init-scripts out there to do this for you. I like to do it myself, my solution (albeit primitive) is to put the details in /etc/rc.local. First I tell the zram-devices how many bytes I want in each, then I make the swap and then i mount the swap. Details below:

# zram swap
# expects zram modprobed with zram_num_devices=2
# allocates 4gb mem to zram, in 2 separate swap partitions.
echo 2147483648 > /sys/block/zram0/disksize
echo 2147483648 > /sys/block/zram1/disksize
mkswap /dev/zram0
mkswap /dev/zram1
swapon -p 100 /dev/zram0
swapon -p 100 /dev/zram1

The reason for using 4gb of my memory as zram is because my machine has a total of 8 gb of memory. Some scripts use 100% of the memory as zram.

Some statistics

I’ve located a script on the old zram site, that prints some statistics about the zram’s at runtime, it can be found here:

http://compcache.googlecode.com/git/sub-projects/scripts/zram_stats

As always YMMW and I’ll be tweaking my setup over the course of time as I run into new bottlenecks.

Making jpg transparent with PIL

A fun experiment, how hard is it to convert a jpg (or anyting else) to a transparent png using PIL (Python Imaging Library) ?

Not very hard it seems, the source can be found here:

https://bitbucket.org/dennishedegaard/transparentpng/src/194d8e5f1dad1491259ad697e9085f3726445b9b/transparentpng.py?at=master

In later commits a webinterface for GAE can be found here: http://transparentpng.appspot.com/

Long-polling chat application in Django

Back when I was a student I messed around with websockets (back then it was grizzly on glassfish). Now adays most of my development is done in python. The nice thing about websockets is that it’s like a TCP-socket where both parties can send data, websockets are like this, but over HTTP. This allows the server to send data to the client without the client being the active party.

The basic “usecase” for long-polling is described below.

  1. The client initiates an ajax-request to the server.
  2. The server check to see if it has something to return, if it has, it returns it and we go back to step 1.
  3. If the server had nothing to return, it waits for a period of time (in my case 20 seconds), check now and again if it has something to return. If it finds something, it returns it and we go to step 1.
  4. If the 20 seconds pass and the server still does not have anything to send the connection is closed (ie the server returns 200 OK or similar), and the client makes a new connection from step 1.

This method brings a lot of overhead, on the plus side it is supported on all browsers that can make ajax-requests reasonably well.

My implementation of long-polling is done in Django, it is focused on keeping the model clean, the javascript tight and the long-polling technique robust. I have tested it on IE 6,7,9 and 10 as well as firefox and chrome.

A running version can at the time of writing be found here:

http://wc.dhedegaard.dk/

The source can be found here:

https://bitbucket.org/dennishedegaard/webchat

I will most likely try to put it up in the cloud (ie GAE, which support Django 1.4 these days), to make sure it stays up.

Getting CentOS 6 to play nice with a serial port

Serial ports might seem like ancient technology now adays. I virtualize everything on my server, for my server virtualization needs I use KVM together with libvirt. This means I usually use virt-manager for managing my VM’s, this is nice when you have a linux environment on the same LAN as the server. However if you’re doing it over the net it takes a long time to do anything (especially in the VNC-client in virt-manager).

Most of my new VM’s these days are CentOS 6 machines, to enable them to send data to tty0 and well as ttyS0 do the following (in /etc/boot/boot):

Add the following to make grub pass through to tty0 and ttyS0:

serial --unit=0 --speed=19200
terminal --timeout=8 console serial

To tell the kernel that it should send the serial port append the following to the kernel parameters:

console=tty0 console=ttyS0,19200n8

In CentOS 6, when the last console statement is to a ttyS, CentOS automatically spawns a getty on the serial port (as explained in /etc/init/serial.conf):

# On boot, a udev helper examines /dev/console. If a serial console is the
# primary console (last console on the commandline in grub),  the event
# 'fedora.serial-console-available  ' is emitted, which
# triggers this script. It waits for the runlevel to finish, ensures
# the proper port is in /etc/securetty, and starts the getty.

In Debian I’ve been doing this for years by adding/changing the following in /etc/default/grub (and running update-grub afterwards):

GRUB_TERMINAL=serial
GRUB_SERIAL_COMMAND="serial --speed=9600 --unit=0 --word=8 --parity=no --stop=1"

GRUB_CMDLINE_LINUX_DEFAULT="quiet console=tty0 console=ttyS0,9600n8"

And uncommenting ttyS0 in /etc/inittab:

T0:23:respawn:/sbin/getty -L ttyS0 9600 vt100

Slashdot Quotes database

On the Slashdot website at the bottom there is a quote that changes from time to time (according to my tests, every hour). I’ve collected a database of the quotes and is currently at 2381 entries.

I’ve implemented a nice graphical interface to searching the database in Django, the database runs on PostgreSQL.

The site can be found here: http://sd.dhedegaard.dk/

I’ve also implemented a REST-like interface that responds to a GET-request on /json/random (for random quotes) and /json/latest  (for the latest quotes). Any of these urls take a count parameter, count is currently capped at 200 entries returned per request.

Game of Life

I’ve spent the day implementing Conway’s Game of Life in javascript and HTML5 using the Canvas element.

Here’s an example of how it looks:

It’s been tested on the following browsers:

  • Google Chrome
  • Firefox
  • Internet Explorer 9

I’ve implemented different interesting patterns as well as a “random” feature. Suggestions and bugfixes are welcome.

Feel free to browse the source: https://bitbucket.org/dennishedegaard/gameoflife.js

You can try it out here: http://p.dhedegaard.dk/gameoflife

Snake game in HTML5

I spend my sunday implementing a snake game in javascript with the drawing done in a HTML5 canvas element. I’ve tested it in firefox 3.6 on lucid as well as chrome 15 and firefox 8.0.1 on mint isadora.

Any information whether it works on Internet Explorer is appreciated since I do not have to opportunity to test it myself :)

Here’s the link: http://p.dhedegaard.dk/snake/

Feel free to look in the source and find all my bugs :)