Release Notes and Instructions for updated EDTK and Berkeley DB driver.
By Chris Newcombe <chris.newcombe@gmail.com>


All of the work on the new version of EDTK was motivated by the need
for a complete 'production quality' driver for Berkeley DB.  Both were
developed together.  This document describes the changes.

[Obligatory disclaimer: all opinions and statements in the
documentation and code are my personal opinions and statements --
i.e. are not necessarily those of my employer.]

Please also read these:

   TODO                      (some important details)
   doc/
     EDTK_BerkeleyDB.ppt     (some rationale and detail)
   examples/berkeley_db/
     berkeley_db_api_support_status.txt

First, many thanks indeed to Scott Lystig Fritchie for writing the
original EDTK and berkeley_db driver -- it has saved me a huge 
amount of time.


Now a ***SERIOUS WARNING*** (ignore at your peril)

If you are not familiar with using Berkeley DB then you should read
its documentation very carefully before using the berkeley_db driver.

   Product home: http://www.oracle.com/database/berkeley-db/index.html
   Docs:         http://www.oracle.com/technology/documentation/berkeley-db/db/index.html

Also, the various public Berkeley DB discussion forums are an
excellent place to ask questions about API usage and application
design:

   Main forum:                 http://forums.oracle.com/forums/forum.jspa?forumID=271
   HA (replication) forum:     http://forums.oracle.com/forums/forum.jspa?forumID=272
   comp.databases.berkeley-db: http://groups.google.com/groups/dir?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=comp.databases.berkeley-db

Berkeley DB is a flexible toolkit for building fast, robust
datastores. It has a very convenient and powerful API.  But that API
is exposed as procedural calls to library functions (i.e. not as a
declarative language like SQL).  And almost all of those library
functions have critical pre-conditions which must be enforced by the
application.

When correctly used, Berkeley DB is *extremely* robust and safe (its
high code quality makes it safe to use the berkeley_db driver in
'linked-in' mode -- i.e. loaded into the Erlang VM's own OS process --
which is very important for performance).  And Berkeley DB does in
fact catch & report many types of erroneous use (i.e. application
bugs).

But to maximize performance (as performance is one of its main
features), Berkeley DB does NOT check or prevent *all* types of
erroneous use.

Certain classes of application bug (e.g. closing a database or a
cursor while it is in use by a transaction) can lead to undefined
behaviour such as bus-errors (segv's) **and even irrevocable
data-corruption** (even when using replication -- i.e. the corruption
may be propagated to all replicas).

[Note: running the Erlang/BDB driver in "spawned program" mode -- aka
"pipe-mode" -- will only protect the Erlang VM from crashing if
Berkeley DB crashes.  But it will NOT protect your data (stored in
Berkeley DB) from being corrupted due to incorrect use of the Berkeley
DB API.  The only way to avoid that is to use Berkeley DB correctly --
follow the documentation carefully, and ask for clarification if it is
not clear, and make backups of your data (Berkeley DB supports hot
backups of live databases).]

Of course, this lack of reslience to some kinds of application bug is
clearly at odds with an important motivation for using Erlang --
building reliable systems in the presence of bugs.

However, the above warnings most often apply to explicit, direct,
incorrect use of the Berkeley API -- i.e. sins of commission that are
easy to avoid once you have thoroughly read the documentation (and
appreciate the danger).

Most importantly, the driver (and helper layers, mentioned below)
*does support* the standard (and vital) Erlang practice of 'crashing'
(exiting) by default if something unexpected occurs.  Great pains have
been taken in both EDTK and the driver code to ensure that it is safe
to use Erlang as it was intended when using Berkeley DB (i.e. that
appropriate cleanup happens after a process crash, and that all
Berkeley DB preconditions and invariants remain satisfied during that
cleanup).

Also, the driver now includes several 'helper layers' (e.g. the
port_coordinator and replication_group server) which add significant
extra layers of convenience and protection (e.g. the port_coordinator
coordinates startup and shutdown of BDB, and owns and manages database
handles, ensuring that those handles will not be closed if a
transaction is running).

But even these layers don't guarantee to catch all potential mis-use
(e.g. explicitly commiting or aborting transaction that still has open
cursors).  You still need to know what you are doing when using the
API.

BDB also has some particularly complex features; e.g. distributed
transactions and replication that can have very unpleasant failure
modes qif you are not careful (inc. deadlocked applications, loss of
data, or unrolling of previously commited transactions. So before
using those features it is essential that you read the relevant docs.

If you are unsure about anything, there is usually a working example
in the test suite

   examples/berkeley_db/berkeley_db_test.erl

[There are instructions for running the tests at the top of that
file.]


Status/Quality Of This Release

This code is pretty much functionally complete -- all major Berkeley
DB APIs are now supported; see 

   examples/berkeley_db/berkeley_db_api_support_status.txt

Some APIs have adjusted (e.g. berkeley_db_replication_group_helpers.erl)
and various bugs have been fixed (including in Berkeley DB itself 
-- all via official Oracle patches).

The goal is to make this code rock-solid and sufficiently performant
for use in high-end mission-critical systems.  

There is an extesive functional/regression-test suite, and the BDB
driver is now used in some real applications, some of which have 
had several months of intensive testing. So the driver appears to be stable, 

But as always, Caveat Emptor.  Do your own testing, with your own use-cases,
and please report any bugs that you see.

Important: BDB exposes a LOT of tuning parameters, many of which have 
such dramatic effects on latency and throughput that they can be 
'destabilizing' from an application's point of view (e.g. cause timeouts) 
if they are set inappropriately.
The parameters often need to be tuned to achieve particular application 
goals (levels of concurrency, latency, throughput) on specific hardware.
Tuning the parameters takes familarity with the BDB documentation and 
often quite a lot of experimentation. To make experimentation as easy 
and painless as possible, the driver exposes all BDB tuning parameters 
in a logical and convenient way.  I strongly recommend that you become 
familiar with these parameters.

Sidebar: EDTK now has a lot of mostly-independent features, so the
combinatorics of testing are pretty heavy.
For example:

  - The berkeley_db driver requires use of private threadpools (to
    avoid various types of deadlocks).  So I haven't tried compiling a
    driver for which "default_async_calls=0", for which
    "default_async_calls=1" but "private_async_threadpool=false", for
    quite a while.  Given the problems with the native Erlang
    threadpool (e.g. just the danger of getting in the way of standard
    drivers like efile_drv), I'd recommend that most drivers use the
    private threadpool feature.

  - The berkeley_db driver uses 'shared' (thread-safe) valmaps for
    DB_ENV and DB, and 'nested' (parent/child relationship) valmaps
    for DB_TXN.  Both of those EDTK features can be combined, but not
    with BDB, so that combination has not been tested (even compiled)
    -- although I believe I did write all of the necessary code for
    that combination.


Note on the other 'example' drivers.

  Only the Berkeley DB driver has been compiled and tested with this
  new release of EDTK. 

  The other 'example' drivers that shipped with EDTK v1.1 have **NOT**
  been upgraded to this version.  Indeed, I have never even compiled
  those drivers.  Therefore those other drivers may now be slightly
  broken.  It shouldn't take too much work to get them working again
  because almost all of the new features in EDTK can be turned off,
  and the internal APIs are largely unchanged the same.  If the other
  drivers are upgraded, then they can immediately take advantage of
  new EDTK features such as private threadpools, shared (thread-safe)
  valmaps, parent/child valmaps etc.

  If the target library has similar requirements to Berkeley DB
  (e.g. high cost to creating a port instance, need for coordinated
  startup/shutdown, advantage in sharing expensive but thread-safe
  resources) then it would be appropriate to write a port-coordinator
  for that library -- adapting the BDB port-coordinator as a starting
  point.  (There is a case for making the port-coordinator into a
  generated file -- i.e. a gslgen template -- because the code to
  forward 'pipe mode' messages to the original sender of the request
  is really part of EDTK infrastructure (essentially part of the
  implementation of the receive_reply function in erl_template.gsl)
  and should be managed as such.  Almost all of the rest of the code
  in berkeley_db_port_coordinator.erl is BDB-specific.


Requirements and Supported Platforms

- Operating System

  I have only tested this on Red Hat Enterprise Linux v3 (the only 
  unix I have access to).
   
  I believe it should work unchanged on more recent versions of linux,
  and other unix platforms that support posix threads.  
   
  The original EDTK v1.1 code was probably not compatible with Windows
  without some changes (due to use of pthreads), and that remains true
  for this version -- in fact probably more so, due to the way
  threadpools are implemented (under Windows the argument to
  driver_select must be an event handle, not a pipe handle).  I think
  that windows support should be fairly straight-forward, but I won't
  have time (or need) to do it myself in the forseeable future.


- Erlang Version

  EDTK now requires Erlang R11B-3 or later, as the ErlDrvBinary
  structure changed due to supporr for SMP.  EDTK supports both smp
  mode and non-smp mode.  The BDB driver automatically uses some
  optimizations in smp mode as some erl_driver functions are now
  thread-safe. However, in testing the non-smp driver seems to
  slightly out-perform the smp driver, presumably due to lock
  contention.

  Intensive testing has been done with R11B-3 in non-smp mode.

  Significant testing was done with R11B-3 in smp mode. Noo problems
  were found.

  Some testing (e.g. regression tests and application tests but not
  stress/performance testing) has been done with R11B-4 in both 
  non-smp and smp modes. No problems were found.

  **IMPORTANT MAINTENANCE ISSUE**
  edtk/erl_driver_pipelib.h now contains some structures and macros copy/pasted 
  from private Erlang VM code.  These are required to make 'pipe mode' work.
  These structures are required to all allow pipe_main.c to pretend 
  (to the generated driver shared-library) that it is the Erlang VM.
  If these structures change in future Erlang releases, then edtk/erl_driver_pipelib.h 
  must be updated.   
  See also the long comments about driver_*_binary() APIs in edtk/erl_driver_pipelib.c 
  and this mailing-list thread:
      http://www.erlang.org/pipermail/erlang-questions/2006-October/023500.html


- Template language

  The driver still requires GSLgen from iMatix.
  It uses the same version that EDTK v1.1 uses.
  If you don't already have it, then the README file (Scott's original) 
  contains instructions.  
  I seem to remember that it was a little fiddly, so here are my notes
  that I recorded at the time.

    cd /home/$USER/erlang/downloads/edtk

    Download   http://www.imatix.com/pub/sfl/src/sflsrc21.tgz
    mkdir sfl; cd sfl
    gunzip -c ../sflsrc21.gz | tar -xvf -
    chmod a+rx c build
    export PATH=$PATH:/home/$USER/erlang/downloads/edtk/sfl
    ./build
     
    Download   http://www.imatix.com/pub/tools/gslsrc20.zip
    mkdir gslgen; cd gslgen
    unzip -a ../gslsrc20.zip
    cd src
    chmod a+rx c build
    cp ../../sfl/*.h .
    cp ../../sfl/*.a .
    ./build

    If you choose a different directory, edit the GSLGEN_EXE variable
    in examples/berkeley_db/Makefile accordingly.


- The BDB driver now requires the following support libraries.

  [If you install them anywhere other than /usr/local you will need 
  to alter paths in examples/berkeley_db/Makefile and 
  examples/berkeley_db/releases/make-release.sh]

    cd ~/erlang/downloads/edtk

    Download from http://www.pcre.org/

    tar -zvxf pcre-6.7.tar.gz
    cd pcre-6.7
    ./configure
    make 2>&1 | tee make.out
    sudo make install 2>&1 | tee make-install.out

  and

    cd ~/erlang/downloads/edtk

    Download from https://sourceforge.net/projects/goog-coredumper/

    tar -zvxf coredumper-0.2.tar.gz
    cd coredumper-0.2
    ./configure
    make 2>&1 | tee make.out
    sudo make install 2>&1 | tee make-install.out


- The BDB driver now requires BDB v4.5.20, which can be downloaded from:

    http://www.oracle.com/technology/software/products/berkeley-db/index.html

  IMPORTANT: this release of the driver also requires all of the patches in 

    examples/berkeley_db/patches-to-berkeley-db/db-4.5.20

  There is a script to apply these in the correct order -- see below.

  Most of the patches are 'official' patches provided by
  Sleepycat/Oracle, and will be incorporated into later public
  releases of BDB.

  Patches should be applies in the root of the unpacked BDB tree, 
  before configuring and building BDB or EDTK/berkeley.
  The patches should all apply cleanly; i.e. no 'hunk offset' warnings 
  from the patch utility.

  e.g. (adjust paths as necessary)

    cd /home/$USER/erlang/downloads/edtk
    tar -zxvf db-4.5.20.tar.gz
    cd db-4.5.20
    ~/erlang/downloads/edtk/edtk-1.5/examples/berkeley_db/patches-to-berkeley-db/db-4.5.20/apply-patches.sh
    cd build_unix
    ../dist/configure --enable-debug --prefix=/home/$USER/erlang/downloads/edtk/BerkeleyDB.4.5
    make
    make install

  **IMPORTANT** 
  Your operating system very likely ships with an older version of Berkeley DB.
  Mixing different 'default' installations of of Berkeley DB on the same host 
  can be very confusing, as the contents of <db.h> change between versions, and
  that header may be compiled into existint programs.
  Therefore I use the --prefix argument to 'configure' to pick a custom 
  installation directory.  (The BDB library will be packaged as part of the 
  berkeley_db application anyway.)
  
  If you do this but pick a different directory then you may need to
  edit berkeley_db/Makefile to set BDB_INSTALL_DIR to the directory
  that you chose.

  Then build EDTK and the berkeley_db driver, and run the tests:

    cd /home/$USER/erlang/downloads/edtk/edtk-1.5/examples/berkeley_db
    pushd ../../edtk; make clean; make; popd; make clean; make
    rm regression.out; make regression 2>&1 | tee regression.out

  **IMPORTANT MAINTENANCE ISSUE**
  Future releases of BDB require that constants in berkeley_db.xml be adjusted.
  Diff the generated (installed) db.h file from BDB 4.5 with the db.h file from 
  the target BDB release to see what must be changed.


Architecture Overview
 
First, please read Scott's original README file which has pointers to
the original EDTK documentation, including his excellent paper.

The following overview begins with a brief, very high-level ('manager
level') summary of how Erlang applications are structured, and how we
want them to interface to BDB. It repeats/summarizes a little of the
overview in Scott's paper.

The later part of the overview explains the new 'helper layers' of the
BDB driver that were not present in EDTK v1.1.  Everyone using the BDB
driver should read that part.

Quick Introduction:

The Erlang VM runs its own scheduler for Erlang processes.  An Erlang
process is similar to a user-level thread but it is totally isolated
-- it can only communicate with other processes via message-passing;
i.e.  no shared memory or mutexes.  An Erlang application typically
consists of hundreds (perhaps thousands or even millions) of these
lightweight processes.  The idea is to have one process for every
concurrent thing in the domain of your application (the 'active
object' model).  The Erlang VM uses either a single OS thread, or a
single OS thread per cpu, to run schedulers for Erlang processes.
Erlang processes are scheduled preemptively.  They can 'crash' without
disturbing other processes (due to no use of mutexes/shared memory).
Processes can also monitor each other and get messates or asynchronous
signals if another process crashes or exits.
 
Erlang interfaces to the outside world via 'port' objects which are
simply message-channels.  To use an external library you have to write
code to make the library look like Erlang processes -- it can only
communicate by sending and receiving messages, and must not block the
Erlang VM process/thread.  Once a library has such a message-passing
interface, it can be used by 'linking it in' to the Erlang VM as a
shared object, or spawning another program (another OS process) and
communicating with it over a pipe (over the spawned process' stdin and
stdout).

However, it's not quite as simple as just creating a message-passing
interface. Another important part of Erlang is the high degree of
isolation between Erlang processes -- e.g. lack of
locking/mutexes. But some libraries expose APIs that may block for a
long time -- even indefinitely.  e.g. Berkeley DB implements a
page-oriented database, and includes a locking subsytem to control
access to those pages.  Therefore, if multiple Erlang processes use
BDB concurrently (which we definitely want), those Erlang processes
can acquire locks, and block against locks held by other Erlang
processes.  This applies to a Berkeley DB application written in any
language (it's not a property of EDTK or the driver.)

Also, Berkeley DB obviously does a lot of disk IO, but it does not
support asynchronous IO because it is a highly portable library and
there is no standard async IO API cross operating systems.  For the
same reason, Berkeley DB does not even create any OS threads itself
(sidebar: one exception to the latter rule is the new 'replication
manager' convenience layer).

Therefore, Berkely DB will often block the application's own OS
threads -- when blocking against a logical lock, or when doing IO (the
latter applies to many kinds of libraries of course, and the new
features in EDTK will help when writing drivers for such libraries).

Of course it is critical that the Erlang VM's scheduler threads are
never blocked by BDB. The standard, supported way to achieve this is
using the Erlang 'async thread pool' ('erl +A' and the driver_async()
API).  Unfortunately that mechanism is not flexible enough for the
Berkeley DB driver (and using it would risk interfering with other
important drivers like efile_drv). So EDTK now implements private
threadpools, and multiplexes commands across those pools.  Each driver
instance (port) has it's own set of threads, and the pools are
resizeable at runtime. (This support only requires the standard Erlang
VM and APIs -- it does not require any modifications to the Erlang
VM.)  The BDB driver currently uses 4 threadpools; see comments at the
top of berkeley_db.xml for details.

Note that when an Erlang processes 'blocks' against Berkeley DB, it is
actually sitting in a receive statement (with a timeout), waiting for
a reply from the driver.  So to an application, blocking against a BDB
lock looks much the same as attempting to access a file on disk, or a
connetion to a client/server database like MySql or PostgreSQL, or
simply waiting for a reply from a gen_server:call.  But to get decent
performance (avoid unnecessary blocking), applications using BDB do
need to consider access patterns and lock contention.  Note that there
is one difference between waiting for the result of a BDB command and
waiting for the result of a normal Erlang gen_server:call -- BDB
commands are processed in threadpools of finite size, which therefore
may become backlogged or saturated.  There are various mechanisms in
place to handle that; e.g. the threadpools can be resized at runtime
(while in use), and applications can query the current length of the
queues (to decide whether to add more threaes).  See also the TODO
file for details of some subtle issues that can arise with threadpool
saturation.)

The role of EDTK:

Much of the code in port drivers is tedious boilerplate.  EDTK is a
code-generation tool that takes a declaration of the target library's
API (e.g. see examples/berkeley_db/berkeley_db.xml) and some template
files containing boilerplate, and produces the Erlang and C code that
implements the message-passing interface for the library.  It
generates the following code:

  Erlang code 

    - Provides Erlang 'stub' functions that map to library APIs.

    - The stub functions marshal arguments into a message.  (A new
      EDTK feature is that we now also pass a a tag that contains a
      unique identifier that can be used to associate a reply with its
      request, and the sender of the request.)

    - Send message to the port

    - Wait for reply. (It is now possible to generate non-blocking
      variants of APIs -- these return the request 'tag', which can be
      used to collect the reply later.)

    - Unmarshal the reply message and return it to the caller.

    - The previous release of EDTK would always return results to the
      caller as {ok, Value} or {error, Value}.  This release gives the
      option (on a per-driver bases) of throwing errors as exceptions
      by default (and returning just the Result, rather than {ok,
      Result}, in the success case).  See this paper for the rationale
      of using throw vs {error, Reason}:

        www.erlang.se/workshop/2004/exception.pdf
        www.erlang.se/euc/04/carlsson_slides.pdf 

      Note: the BDB down now does this (and this is a significant API
      change from the berkeley_db driver in EDTK v1.1).

      See usage examples in:

        examples/berkeley_db/berkeley_db_test.erl

  C code

    - Unmarshals command arguments from the message

    - Create 'command' objects and enqueue them on the
      producer/consumer FIFO queue for the appropriate threadpool.

    - Implement the threadpools -- i.e. consume items from the queues,
      call the BDB API with the correct arguments, and enqueue the
      response in an internal queue for the Erlang VM process/thread
      to collect (and then tell the Erlang VM that the internal queue
      contains something, by writing to a pipe fd).

    - Marshals the reply into a message

    - Sends the reply back to the Erlang VM.  The message is sent
      directly to the process that sent the original request.  This is
      a significant difference between this release of EDTK and
      previous releases, because it enables multiple Erlang processes
      to share a single port instance (previous releases of EDTK
      required that each Erlang process opened its own port instance).
      However, this feature greatly complicates the implementation
      details of EDTK, because of the need to do proper cleanup if an
      Erlang process crashes while owning BDB resources like
      transactions and cursors. In previous releases that process'
      port instance would have been closed automaticall, and the
      driver's C stop() function would have done the cleanup.  But in
      this release the single shared port instance cannot be closed
      just because one of the processes using it happened to crash.
      See next point (and 

    - Tracks resource usage by the Erlang application; these are
      stored in arrays called 'valmaps'.  e.g. The BDB driver tables
      of open DB_ENV, DB, DBC, and DB_TXN handles.  If the port driver
      is closed unexpectedly (e.g. the Erlang process that owns it
      crashes) then the shutdown C code closes cursors, aborts
      transactions, closes database handles, closes the environment
      handle, all in the correct order (even if some are currently in
      use when the shutdown begins);.

      Important: In this release, resources (valmap table entries) are
      tagged as belonging to a specific Erlang process -- any process
      that wants to use BDB.  A command is provided to cleanup just
      the subset of resources owned by a specific Erlang process
      (should it crash).  But note that this is entirely hidden from
      application code in normal use (i.e. you don't have to worry
      about it).  See the description of the new 'port coordinator' 
      server later.

    - Checks that constraints are met; e.g. transactions can only be
      used by one command at a time, a prepared transaction cannot be
      automatically aborted if a process crashes (it must be resolved
      by a GTM), etc. etc.  Basically it implements most (but not all)
      of the pre-conditions/constraints described in the BDB
      documentation.

    - Converts asyncyronous events from Berkeley DB (the new
      DB_ENV->set_event_notify mechanism in BDB v4.5) to an Erlang
      message and sends them to Erlang (to the port-coordinator
      mentioned later).

    Plus about a zillion other details.
 
Using The BDB Driver:

The BDB driver can be packaged as an Erlang/OTP 'application', using 
the (very basic) script:

     examples/berkeley_db/releases/make_release.sh

The resulting directory tree is an OTP 'library application'
(including a minimal .app file), like STDLIB.  i.e. It is just a
collection of code modules; it doesn't have start/2 function for use
by application:start().

A Erlang system that uses Berkeley DB will consists of one or more
Erlang nodes, each of which loads the berkeley_db application created
with the above script, and then opens one (or rarely, more) BDB
environments via one of the following two methods:

- If replication is not being used then the application starts an
  instance of berkeley_db_port_coordinator per BDB environment.

- If replication is being used, the application starts an instance of
  of berkeley_db_replication_group (which in turn starts a
  port-coordinator process, and supervises it), per BDB environment.

Both of the above processes are standard gen_servers.  

Importantly, they both have start_link functions, and can participate
in standard OTP supervisor trees.

**API note*** In general, all of the BDB driver functions conform to
'modern' Erlang thinking and return a plain Result or throw an
exception on error. i.e. They don't return {ok, Result} or {error,
Reason}.  However there are exceptions to that rule -- e.g. all
start_link functions do still return {ok, Pid} or {error, Reason} as
OTP supervisors require that.

Please read the API documentation at the top of the respective .erl files.
What follows here is just a summary:


berkeley_db_port_coordinator.erl:

- A port_coordinator owns the single port instance that represents and
  communicates with a single BDB environment on the local machine.

  A single Erlang VM can run as many port_coordinators (and therefore
  as many different BDB environments) as it likes.  Most applications
  will only need to use one.  

  IMPORTANT: It is *CRITICAL* that only a single port-coordinator be
  configured to talk to a given BDB environment at any given time.  If
  two port-coordinators open the same BDB environment then irrevocable
  data-corruption is almost certain.  This may be made safe by use of
  BDB's DB_REGISTER flag (see BDB documentation).

  The port_coordinator does the following:

  - Opens the port driver for an environment (which creates a single
    DB_ENV handle for that environment).

  - Provide a very flexible & homogeneous configuration facility for
    almost all BDB config APIs and parameters, including sensible
    defaults.  (This is quite an important convenience, as Berkeley DB
    has a *lot* of configuration parameters, scattered across a wide
    number of APIs. It makes it possible to configure an entire BDB
    environment by passing a single Erlang term (lists of tagged
    tuples) to the port-coordiantor at startup.)

  - Calls DB_ENV->open() to open the BDB environment.  Typically it
    also runs Berkeley DB recovery, to ensure that the database is in
    a consistent state after any earlier crash.  BDB recovery must be
    coordinated (e.g. single-threaded), so the port-coordinator does
    this synchronously, before it returns from init().

  - If distributed transactions are being used, then it runs
    DB_ENV->txn_recover and if necessary contacts a
    GlobalTransactionManager to resolve any unresolved txns that it
    finds during recovery.  [Important: The BDB API makes this a
    blocking operation -- the port-coordinator cannot continue unless
    the GTM replies.  Hence to avoid serious availability issues, the
    GTM should be replicated. Work is in progress on that.]  The
    port-coordinator also does 'incremental recovery' of unresolved
    transactions while the application is running (e.g. if a process
    calls txn_prepare on it's local part of a distributed transaction,
    but then crashes before it commits or aborts the transaction, then
    the port-coordinator takes ownership of that unresolve
    transaction, and asks the GTM for a decision whether to abort or
    commit it).  Important: distributed transactions should not be
    used without knowing what you are doing (e.g. all such
    transactions must use lock or txn timeouts to avoid distributed
    deadlock).  So caveat emptor.

  - If replication has been enabled then it call DB_ENV->repmgr_start,
    and repmgr_add_remote_site etc, to join a replication group.
    (There is a lot more to replication than this; for instance, the
    election of a 'master' site (only the master can do write
    operations).  Before using replication, read the BDB replication
    documentation.

  - Allow client processes to 'register' to use the BDB environment
    This means that the port_coordinator tracks the life of the client
    process, and will cleanup after that process if it crashes --
    close any open cursors, abort open transactions, etc.

  - Hand out shared DB handles to clients on demand (one handle per
    database, shared by all clients).  The port_coordinator creates
    the database on first use (i.e. specifies DB_CREATE to db_open)

  - Receives all BDB event notifications via the mechanism described
    below, and publishes them to an arbitrary set of interested
    (registered) processes via a gen_event handler.  This set of
    registered processes can be entirely independent from the set of
    registered clients that have asked to use the environment.

  - Recieves custom events from the C code in the driver.  e.g. If an
    attempt is made to automatically 'clean up' a transaction that has
    been prepared but not yet explicitly commited or aborted (e.g. one
    process participating in the distributed transaction crashed), we
    cannot unilaterally abort that txn because other sites might
    already have been told to commit.  Instead of aborting the txn,
    the C code sends an event (and a reference to the txn handle) to
    the port_coordinator, which asks the GTM to resolve the txn -- and
    the port_coordinator then commits or aborts the local txn.


berkeley_db_helpers.erl:

  This contains convenience functions for using BDB -- e.g. a do_txn()
  function that takes an entire transaction (packaged as a {Module,
  Function, Args} tuple), and executes it against a given BDB
  environment (i.e. it takes a port-coordinator Pid or name).


berkeley_db_sequence_server.erl

  This is a (useful) example of a BDB application.  It is a gen_server
  that implements a classic database 'sequence'.  That is, a series of
  (in this case 64-bit) integers that do not repeat, even if the
  server crashes (i.e. consumption of integers is persistent).  The
  integers are suitable for as message-ids, primary keys, etc.  Note
  that the sequence is NOT guaranteed to be purely sequential -- there
  can be gaps (possibly very large gaps), and higher integers may be
  returned before lower integers (although this is rare).  The only
  guarantee is that the same integer will not be returned twice.
  Obtaining an integer is typically very fast as the server holds a
  'cached range' of integers, and only needs to do a BDB transaction
  when that range is exhausted.  The size of the range can be set when
  the server is started.  Specifying a large range reduces the
  frequency of transactions, but results in larger gaps in the
  sequence (irrevocably wasted integers) when the server is restarted
  (either normally or after a crash), as each restart must begin
  consuming from the *next* cached range (as it is not know how many
  integers from the last range were actually consumed).  However, with
  a 64-bit total range there is no lack of integers -- ranges of size
  1,000 or more are typically fine.  

  Note that a recent version of Berkeley DB added an almost identical
  feature (see the documentation for DB_SEQUENCE).  However, the BDB
  driver does not implement those BDB APIs because all calls to BDB
  must incur overhead in marshalling/unmarshalling, enqueuing to
  threadpools etc, and most calls to sequence_server only increment 
  and return an integer. i.e. It's faster to keep the cached-range 
  on the Erlang side of the port boundary, and do explicit transactions 
  when a new range needs to be started.
  
  In the near future this application will be changed (or a variant
  created) that supports replication, so that if a host/disk dies 
  permanently, the current value of the sequence is not lost.


berkeley_db_global_transaction_server.erl:

  This is part of the support for distributed transactions.  It
  provides (on demand) 128-byte GUID required for distributed
  transactions (see the BDB txn_prepare API).  It also acts as an
  authoratative repository for the state of all distributed
  transactions (i.e. whether they have committed or aborted), which is
  queried if a partitipant in a distributed transaction fails after
  preparing it's local part of the transaction (when the partitipant
  recovers it asks the GTM for the state of the transaction, in order
  to commit or abort it locally).

  In the near future this application will be changed (or a variant
  created) that supports replication.  This is vital for reliable 
  use of distributed transactions, as applications can block 
  permanently if the GTM is not available or loses data.


berkeley_db_msg_queue.erl:

  This is a sketch of an example use-case for the BDB driver.  It
  implements an asynchronous transactional message queue with
  guaranteed exactly-once delivery.  It is designed to given similar
  reliablity guarantees as distributed transctions but avoid the
  availability issues of 2-phase-commit.  In particular, the
  destination may be unreachable but senders can still transactionally
  enqueue messages.

  The idea is that some 'sender worker' process (bit of business
  logic) performs a BDB transaction do some work, and wants to notify
  another process (presumably on another node) that the work has been
  done (and perhaps pass some arbitrary payload with the message).  So
  the sender uses functions in this module to enqueue a message in a
  Berkeley DB queue database as part of the normal work transaction.
  That's all the 'sender' has to do -- i.e. there is no 'send message'
  API, just an enqueue API.
  
  A separate 'send pump' process transactionally consumes messages
  from the local message queue, sends them to a destination 'receive
  pump' process, waits for an ack, and then commits the consume
  transaction.  (If a consume transaction is aborted the message 
  automatically reappears in the queue, ready for a retry.)

  The 'receive pump' process (on another host/BDB environment,
  otherwise there is no point) receives messages, transactional
  enqueues and commits them to a local queue database, and then
  replies with an ack to the 'send pump'.  (Note that if this ack is
  lost the 'send pump' will abort it's consume transaction and try to
  send the message again, so to guarantee exactly-once delivery all
  messages have a guaranteed-unique id (an instance of sequence_server
  is used for this) and the receive pump records the ids of all
  messages that it has commited to its local queue -- and simply
  discards and acks (again) any resends of messages that it has
  already accepted.)

  Now a 'receiver worker' process (another piece of business logic) on
  the same host as the 'receive pump' process can transactional
  consume messages from the receive queue, process them, and if the
  work transaction succeeds, commit the consume transaction.

  One implementation detail: Berkeley DB queue databases have
  fixed-size records, but we want to support arbitrary payloads.  So
  if a message won't fit into the configured maximum queue record
  length, part of it is stored in a btree database.  The two databases
  are manipulated in the same (nested) transactions, so they will
  always remain consistent.

  This code has not yet been 'productized' or even fully tested.  In
  particular it needs a better routing layer (currently it uses
  'global' process registration of the receive pump process, as an
  experiment -- see later for why this is not a good idea) So treat it
  as an educational example -- i.e. caveat emptor.

  Also, this subsystem is intended to run with replication, so no
  messages can ever be lost even if a host or disk fails.  But the
  implementation does not yet support that.

  There is a usage-example/test of this code in berkeley_db_test.erl.


berkeley_db_replication_group.erl:
 
  The replication_group processes are responsible for tracking the
  state of all sites in a replication group, including the location of
  the master.

  A replication_group server does the following (see source file 
  for more details):

  - Spawns and supervises a local port_coordinator and registers as an
    subscriber to it's event channel.

  - Attempts to find all other sites in the replication group (and
    retries if they are down).  These other sites might be on other
    Erlang VMs (on this host or other hosts).

  - Optionally it can create the other sites if they are down (at
    startup or if they crash for any reason).  So an entire
    replication-group across multiple nodes/hosts can be started by a
    single API call.  Note that the sites will all attempt to 'peer
    supervise' each other -- i.e. restart each other if they crash.
    This is an experimental feature.  There is nothing wrong with
    starting each site in a replication_group via a local supervision
    tree on each node (infact that is the recommended approach).

  - Listens for and tracks status updates from all sites

  - Implements a simple state machine that converts sleepycat events
    from all sites to absolute states.

  - (Important convenience).  All sites are willing to accept
    'do_txn_on_master' calls, and (if the local site is not the
    master) will transparently forwards user transaction functions to
    the current master site.

    For this API, transactions are 'packaged' as {Module, Function,
    Arg} tuples.  Note that funs are NOT used due to various issues
    with code upgrade (we want to be able to upgrade code across a
    replication group on a per-site basis, and funs are bound to the
    hash of their module; {Module, Function, Args} tuples do not have
    that problem.

  - All sites are willing to acdept 'do_txn_on_any_site' calls 
    (i.e. read-operations).  If the local site is 'in-sync' with 
    the master (or if no master is running) then the local site 
    will perform the transaction.  Otherwise the transaction will 
    be sent to a site that is in-sync.

  - Handles adding/removing sites to the group.  (This is not yet
    fully implemented -- it requires BDB features which will hopefully
    be released in Berkely DB v4.6 in 2007).

  - Coordinates clean shutdown of all sites in the group (when the
    application shuts down).  This involves suspending writes and
    waiting for all sites to fall into sync.  (This is not yet
    implemented -- again in needs features in Berkeley DB v4.6.)

  Approach To Distributed System Communication

  When configuring a replication_server, each site needs to know about
  all of the others in the group.  Sites exchange replication data
  (handled by BDB internals), and coordination messages and the Erlang
  level (including tracking which site is master).

  Note that Berkeley DB's replication layer makes its own connections
  -- replication traffic does *not* pass through Erlang processes.
  The BDB APIs for this require TCP/IP addresses, of the form 
  {HostName, Port} tuples.

  But the Erlang-level coordination (tracking of master, forwarding
  'transaction functions' to be executed on the master and then
  returning the results) use Distributed Erlang.

  However, the code restricts itself to a 'safe subset' of Distributed
  Erlang features. In particular, BDB replication is designed to work
  correctly in even if the network is partitioned temporarily.  Some
  Distributed Erlang features do not work/recover well in the presence
  of network partitioning.

  The code does use the following features of Distributed Erlang:

  - the basic connection setup/heartbeat/teardown/retry stuff 
    (named nodes, epmd, net_kernel etd)

  - 'remote pids', including link/monitor of pids on remote nodes

  In particular, the code makes frequent use of *local* process
  registration and the

       {ProcName, NodeName} 

  form of addressing remote-processes (e.g. to gen_server:call() et al).
  We also use remote-pids as 'cached' forms of these addresses.

  The code does *not* use the following features of Distributed Erlang
  as they are rumoured or known to have problems under network
  partitioning.:

  - dist_ac
  - global
  - pg, pg2
  - mnesia

  Another potential issue with using Distributed Erlang is that it
  only uses direct routing, and by default it attempts to maintain a
  fully-connected network (N^2 connections).  This imposes a ultimate
  limit to scalability.  net_kernel can be set to do lazy on-demand
  connections rather than transitive proactive connections, but unless
  application communication patterns are constrained, it will
  eventually create a fully-connected network.  The largest
  Distributed Erlang system I've heard of (from a mailing list post)
  is 80 nodes. It seems quite feasible to have low-hundreds, but
  possibly not thousands until the epoll patch is made official.

  The replication_server code is designed to run with relatively small
  groups (3 to maybe 20 or 30 nodes).  It has only been tested with up
  to 5 nodes.

  One option is to use Distributed Erlang within each replication
  group/cluster, and some other protocol between clusters.


berkeley_db_replication_group_helpers.erl

  The fact that replication groups don't use any kind of global
  process/site registration service invites the question, how does
  application code find a site in order to run a transaction?  This is
  a particularly relevant question given that the whole point of
  replication groups is to be highly available even if some percentage
  of their sites are down or unreachable.

  First note that some applications don't have this problem because 
  they use only a single replication_group, which has a site on 
  every node in the system.  Therefore the application just attempts 
  to use the site on it's local node, as if the application is 
  running then it's highly likely that the replication site will 
  be running (or will be soon, if it is running recovery).
  The local site will forward any write transactions to the 
  master, and will perform any read transactions locally, so 
  long as the local site is 'in sync' with the master (processing 
  live updates, not doing bulk-recovery, aka 'bootstrapping').

  But if a system does not have a replication site on every node,
  (e.g. if the system is a set of replication-clusters), then we 
  need to be able to find and use 'remote' replication sites.
  
  It is assumed that the entire user application (i.e. every node)
  *does* know the names and nodes of all the replication sites (this
  information is required to start each replication_group server, so
  it must be globally available somehow).

  So all we really need to access 'remote' replication sites are some
  helper functions:

    - to remember the latest 'master location hint' (to optimize
      routing of write transactions)

    - to load-balance read transactions across all sites that are
      believed to be reachable

    - to remember which, if any sites, we have found to be unreachable
      (so we don't try to send requests to them for a while)

    - to periodically try to contact any unreachable sites, to see if
      they have recovered yet

  As replication groups normally exist in a steady state without
  failures (most/all sites running, and a stable master), we simply
  need to discover that state, remember what it is (as an
  optimization), and re-discover it fairly quickly after a change
  (e.g. if the master becomes unreachable we need to stop trying to
  connect to that site for a while, and also find the location of the
  new master).

  The functions in this module accomplish this.  They manipulate an
  opaque object called a caller_state (record).  The definition of
  this record is available in a header file incase it is useful, but
  most of the time it should be possible to treat it as an opaque
  token.  The idea is that each node stores the caller_state for each
  replication group somewhere -- an ETS table is perfect.  Then when a
  process wants to do a transaction on a replication group, it simply
  looks up the caller_state for that group, and passes it to one of
  the helper functions in this module (do_txn_on_master or
  do_txn_on_any_site), along with the {Module, Function, Args} tuple
  that it wants to execute.

  In the candidate 1.5 releases this module had not been fully tested.
  It has now been tested, and the API should be stable.

  Also, support has been added for distributed transactions against
  the masters of separate replication groups (see
  do_distributed_txn_unordered).


List of Changes since EDTK v1.1 (might not be totally complete --
there were a lot of changes).

Please see doc/EDTK_BerkeleyDB.ppt for some design rationale and
explanation.

The changes can be categorizes as follows:

  - Enhancements to EDTK
  - Enhancements to BDB driver
  - Bug fixes to EDTK
  - Bug fixes to BDB driver


Enhancements to EDTK framework (not driver-specific)

- The previous release of EDTK used the Erlang VM's 'async IO
  threadpool' for Berkeley DB operations, but that has several 
  critical problems.

  - Long-running BDB operations would block IO by other important
    drivers (like efile_drv which provides access to the file-system).
    e.g. Some BDB operations may take a very long time
    (e.g. txn_checkpoint, memp_trickle, db_compact), or may block a
    thread indefinitely (e.g. db_get with the DB_CONSUME_WAIT option).

  - The native Erlang VM's threadpool schedules jobs on a round-robin
    bases, or by using a hash of an application-specified key.  The
    round robin mechanism would violated BDB preconditions on thread
    use (if it did not deadlock first), and the hash mechanism would
    require a dedicated thread for each *possible* concurent BDB
    transaction (so hundreds of threads, most of which would be idle),
    and would still not be safe from deadlock (as some BDB operations 
    use private internal transactions, so the application does not 
    have a 'key' to hash to a thread id). 

  - The native Erlang threadpool is sized at runtime (with 'erl +A'),
    but we want more flexibility than that.

  This is fixed in this release by allowing each port instance to
  create multiple private worker thread pools, dedicated to specific classes of
  operations.  Each threadpool is fed by a producer-consumer queue 
  (unlike the native Erlang VM threadpool in which each thread has its own 
  queue, so some threads may be busy and have work stuck in their queues 
  while other threads sit idle).
  The threadpools may be resized at runtime (threads added/removed), 
  the stack size for threads may be specified (to reduce virtual memory 
  usage with large numbers of threads -- Berkeley DB only needs about 
  32KB -- 64KB stack), and the application can place limits on the 
  lengths of the threadpool queues, to reject new commands under 
  saturation.  (The length limits can be changed at runtime.)

- Added support for shared-access valmaps.

  Some libraries may produce resources (valmaps) that are thread-safe,
  and can safely be used by multiple Erlang processes at once.
  (e.g. Berkeley DB DB_ENV and DB handles can be made thread-safe).
  So a valmap type may now be declared as 'shared', and appropriate
  reference-counting will be done to manage automatic cleanup.  Also,
  some operations on shared valmaps may need exlcusive access
  (e.g. any 'stop' operation like txn commit/abort, or cursor close).
  Such operations do test-for and obtain exclusive-acces.  This
  feature is an important enabler for the BDB driver, as DB_ENV and DB
  handles are documented as being slow to open (and expensive in terms
  of memory and other resources).  So for large numbers of Erlang
  processes to use BDB together, it is essential to share these
  handles.  Also many BDB features (e.g. replication) are more
  difficult to use if more than a single DB_ENV handle is open on an
  environment at a time.

- Added support for parent/child relationships between valmaps slots
  (*of the same type*).  The relationship force correct ordering of
  cleanup operations. i.e. If a valmap with any children is 'stopped'
  then all of its children (recursively) are automatically 'stopped'
  first.  This feature was necessary to support BDB nested
  transactions correctly.

- Added EDTK support for converting numeric library parameters and
  return codes to/from atoms.  

  Berkeley DB has a _lot_ of public #define constants of the form
  DB_xxx which are used as flags, enums, and/or return codes.  These
  constants can and frequent do change numeric value between releases
  of BDB.  That's fine for a C application (it just needs to recompile
  to pick up the new constants), but a distributed Erlang application
  must deal with different sites running different versions of BDB
  (during an site-by-site upgrade), and it is critical that the
  meaning of the parameter/return code values not be
  mis-interpretted. So now the Erlang stubs accept atoms or lists of
  atoms for these parameters, and convert them to the correct numeric
  constants at the last possible moment before calling the C code.

- Added EDTK support for pattern matching on selected return values
  from the C driver, and taking arbitrary actions (e.g. throwing, 
  exiting, transforming the result before returning it).

- Added EDTK support for throwing all {error, Reason} return values as
  Erlang exceptions by default, and pattern-matching selected return
  values to be returned as {error, Reason} tuples.  The Berkeley DB
  API now uses this form of interface, which makes application code
  *much* more convenient.  The previous EDTK release predates system
  probably did not have this feature due to the fact that Ericsson
  only recently added greatly improved exception handling to Erlang
  (try/catch/after, rather than the old catch construct).

- Added EDTK support for returning large complex structures from the
  driver using the 'Erlang external term' format (the 'ei' and
  'erl_interface' C libraries provided by Ericsson).  This is used in
  supporting the BDB 'statistics' APIs.

- Improved logging when debugging EDTK and/or driver code.  e.g. Each
  port instance may now have a textual 'label' per port-instance, to
  distiguish log messages from different instances running on the same
  node. (The BDB port-coordinator sets this label to the registered
  name of the port-coordiantor process.)


Enhancements to BDB driver

- EDTK v1.1 did not support some essential parts of the Berkeley DB
  API, largely due to safety/correctness issues arising from the lack
  of flexible threadpools.

  Support for the following has been added:

      - transactions, including nested transactions
      - distributed transactions, and recovery (including a GlobalTransactionManager)
      - important 'housekeeping' functions such as txn_checkpoint, log_archive, memp_trickle etc.
      - all 'statistics' APIs   : these return large structures full of useful internal data
      - replication             : a large, complex topic on its own
      - db_compact              : 'defragmentation' of live btree databases
      - db_get(DB_CONSUME_WAIT) : blocking consume-from-DB_QUEUE-database

  See examples/berkeley_db/berkeley_db_api_support_status.txt for a summary by API.

- One of Erlang's main strengths is that it is reasonable/normal to
  have thousands of Erlang processes running on a single node.  And of
  course we want a lot of Erlang processes to be able to
  simultaneously use the C libraries that are wrapped by EDTK.

  The previous release of EDTK had an implicit (but totally
  reasonable) assumption; that.concurrent access to a C library from
  Erlang would be achieved by opening a separate port (a separate
  instance of the driver/library) from each Erlang process that wants
  to use that library.

  It's hard to argue with that model -- its clean and simple, and most
  importantly, the Erlang VM guarantees that all ports opened by a
  process will be automatically closed if that process exits.  So
  (with appropriate C code in the driver's stop() function), cleanup
  is automatic.

  Unfortunately that model is not a good fit for Berkeley DB, for
  important practical reasons.

  - BDB requires coordinated (single-threaded) startup and shutdown.
    Also there are houskeeping jobs like txn_checkpoint, log_archive
    etc. that should be run per environment (and must be coordinated).

  - Opening BDB environment and database handles is slow.  Fortunately
    these handles can support shared-access (they are thread-safe),
    and the BDB docs strongly advise applications to share them to
    achieve good performance.  But with transient worker processes
    would be opening and closing DbEnv and Db handles per transaction
    -- i.e. performance would be dire

  - Even if opening a separate DB_ENV handle per port was fast, many
    BDB features work best with only a single DB_ENV handle
    (e.g. replication).

  - The Erlang VM places a hard limit on the total number of ports (of
    any kind) that can be opened simultaneously.

  - We probably want to limit the number of processes using BDB (to
    avoid thrash), but in the one-process-one-port model there is no
    way to limit the degree of concurrent access (number of open port
    instances) other than running into the VM's hard-limit (which
    would starve the system of other kinds of ports).

  All of those problems are fixed in this release allowing multiple
  Erlang processes to share one instance of a port.

  The sharing of a port is accomplished by a few changes

  - A new module called 'berkeley_db_port_coordinator.erl' that
    coordinates startup, shutdown, sharing of DB handles, incremental
    resolution of distributed transactions, and other things.

  - A small change to linked-in mode to use driver_caller and
    driver_send_term instead of driver_output_term, to allow the reply
    to be sent to the process that sent the command, rather than
    always to the ports connected process

  - All commands to an EDTK-generated driver now carry a unique tag
    (currently term_to_binary({Ref,SendPid}) to precisely associate
    replies with their commands.  In pipe mode (when all replies
    arrive at the port's 'connected process' -- i.e. the port
    coordinator), we have enough information to forwarded the reply to
    the process that sent the command.

    The tag mechanism also allows asynchronous (non-blocking) variants
    of commands to be generated.  i.e. It is possible to generate a
    function to send the command (and return the generated tag), and
    another function to do a blocking receive for the reply to that
    command (the function requires a tag of course).  This can be
    useful in some circumstances -- e.g. servers that want to use
    Berkeley DB but don't want to spawn worker processes for each
    operation.  The port-coordiantor uses this mechanism for some
    internal operations (e.g. during shutdown).  Currently only a few
    BDB API functions have non-blocking variants -- simply to avoid
    code-bloat (most applications won't need them).

- dbenv_close and dbenv_remove now flagged as global-exclusive
  operations.  These are not allowed if any valmaps are in-use at all.


Bug fixes to EDTK

- EDTK's 'valmap' system for managing handles had a known, dangerous
  weakness; Erlang applications may accidentally use expired valmap
  entries (i.e. references to transactions that have been
  commited/aborted), which will crash the Erlang VM.  This has been
  fixed with per-slot generation-ids.  Each valmap slot has a 32-bit
  counter. Eadh 'valmap record' returned to Erlant contains the value
  of that counter. Any valmap 'stop' ('free') function increments the
  value of the counter.  So if valmap record is used after it is
  'stopped' (e.g. a db_txn handle is used after it is committed or
  aborted), the counter value shows that the valmap record has
  expired, and an error is thrown.

- The value returned by the wrapped library was leaked if no valmap
  slots are free after the call. The solution was to reserve the slot
  when the command is received.

- Driver shutdown didn't acquire a critical mutex, but operations may
  still be in-flight (race condition/undefined behavior)

- Driver shutdown didn't wait for in-progress operations involving
  first valmap type to finish before it starts closing valmaps of the
  second type (and so on)

- Driver shutdown did not call cleanup_.._index() for any valmap
  entries that are INUSE, so the code in cleanup_.._index() that sets
  DELAYED_CLEANUP is never activated, so some valmaps (that are in-use
  when stop is called) are never cleaned up.  (DELAYED_CLEANUP would
  self-deadlock if it was ever used, as cleanup_index() was called
  while desc->mutex was held, and cleanup_index tried to acquire the
  (non-recursive) mutex again).

- Driver shutdown could consume infinite stack if it was not safe to
  return (valmaps were still in use)

- Cleanup_index held mutex while calling the valmap cleanup function
  (e.g. txn_abort, for BDB) - which may take a long time.

- Several bugs in 'pipe mode' prevented Berkeley DB from being shut down cleanly.

- The DELAYED_CLEANUP feture would self-deadlock if used, as it re-acquired a non-recursive mutex.

- If a successful valmap 'start' (allocation) operation is performed
  after driver shutdown has begun then the object would be leaked (the
  async_free call is currently just sys_free (to free the callstate
  object), so nothing currently calls the valmap cleanup_func on the
  result of the library call made by invoke -- i.e. an object returned
  from the library (bdb) is leaked)


Bug fixes to BDB driver

- The EDTK v1.1 examples/berkeley_db driver installed allocation
  functions that called driver_alloc_binary in async workers (from
  within BDB code).  This resulting in memory corruption and segfaults
  under concurrent load.
  Amusingly, this practice actually became legal in Erlang R11B
  but only when SMP support is enabled.  I have a TODO to add this 
  back into the driver (it is a potentially important optimization, 
  as it avoids otherwise redundant allocations and copying of data).

  See this thread on the Erlang mailing list for related details:
  http://www.erlang.org/pipermail/erlang-questions/2006-October/023500.html

- db_rename, db_remove, db_upgrade were not tagged as valmap "stop"
  operations, so the Db handle was left in the valmap array, and would
  crash during _stop().

- Removed support for env_set_errpfx as it made BDB refer to memory
  that had already been released (undefined behaviour)

- Data returned by db_get, c_get etc would be leaked if driver
  shutdown was started before they completed.

- To avoid potential deadlocks of the Erlang VM, BDB's auto-deadlock
  detection feature is now explicitly enabled during driver shutdown,
  as the application may have been calling the explicit deadlock
  detector frequently, but it can no longer do so once shutdown has
  begun, and operations might still be in progress that could deadlock


Support

I intend to support and enhance this library as time permits.
Please send questions, suggestions, bug reports, and patches
to chris.newcombe@gmail.com


End of file.