Release Notes and Instructions for updated EDTK and Berkeley DB driver. By Chris Newcombe All of the work on the new version of EDTK was motivated by the need for a complete 'production quality' driver for Berkeley DB. Both were developed together. This document describes the changes. [Obligatory disclaimer: all opinions and statements in the documentation and code are my personal opinions and statements -- i.e. are not necessarily those of my employer.] Please also read these: TODO (some important details) doc/ EDTK_BerkeleyDB.ppt (some rationale and detail) examples/berkeley_db/ berkeley_db_api_support_status.txt First, many thanks indeed to Scott Lystig Fritchie for writing the original EDTK and berkeley_db driver -- it has saved me a huge amount of time. Now a ***SERIOUS WARNING*** (ignore at your peril) If you are not familiar with using Berkeley DB then you should read its documentation very carefully before using the berkeley_db driver. Product home: http://www.oracle.com/database/berkeley-db/index.html Docs: http://www.oracle.com/technology/documentation/berkeley-db/db/index.html Also, the various public Berkeley DB discussion forums are an excellent place to ask questions about API usage and application design: Main forum: http://forums.oracle.com/forums/forum.jspa?forumID=271 HA (replication) forum: http://forums.oracle.com/forums/forum.jspa?forumID=272 comp.databases.berkeley-db: http://groups.google.com/groups/dir?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=comp.databases.berkeley-db Berkeley DB is a flexible toolkit for building fast, robust datastores. It has a very convenient and powerful API. But that API is exposed as procedural calls to library functions (i.e. not as a declarative language like SQL). And almost all of those library functions have critical pre-conditions which must be enforced by the application. When correctly used, Berkeley DB is *extremely* robust and safe (its high code quality makes it safe to use the berkeley_db driver in 'linked-in' mode -- i.e. loaded into the Erlang VM's own OS process -- which is very important for performance). And Berkeley DB does in fact catch & report many types of erroneous use (i.e. application bugs). But to maximize performance (as performance is one of its main features), Berkeley DB does NOT check or prevent *all* types of erroneous use. Certain classes of application bug (e.g. closing a database or a cursor while it is in use by a transaction) can lead to undefined behaviour such as bus-errors (segv's) **and even irrevocable data-corruption** (even when using replication -- i.e. the corruption may be propagated to all replicas). [Note: running the Erlang/BDB driver in "spawned program" mode -- aka "pipe-mode" -- will only protect the Erlang VM from crashing if Berkeley DB crashes. But it will NOT protect your data (stored in Berkeley DB) from being corrupted due to incorrect use of the Berkeley DB API. The only way to avoid that is to use Berkeley DB correctly -- follow the documentation carefully, and ask for clarification if it is not clear, and make backups of your data (Berkeley DB supports hot backups of live databases).] Of course, this lack of reslience to some kinds of application bug is clearly at odds with an important motivation for using Erlang -- building reliable systems in the presence of bugs. However, the above warnings most often apply to explicit, direct, incorrect use of the Berkeley API -- i.e. sins of commission that are easy to avoid once you have thoroughly read the documentation (and appreciate the danger). Most importantly, the driver (and helper layers, mentioned below) *does support* the standard (and vital) Erlang practice of 'crashing' (exiting) by default if something unexpected occurs. Great pains have been taken in both EDTK and the driver code to ensure that it is safe to use Erlang as it was intended when using Berkeley DB (i.e. that appropriate cleanup happens after a process crash, and that all Berkeley DB preconditions and invariants remain satisfied during that cleanup). Also, the driver now includes several 'helper layers' (e.g. the port_coordinator and replication_group server) which add significant extra layers of convenience and protection (e.g. the port_coordinator coordinates startup and shutdown of BDB, and owns and manages database handles, ensuring that those handles will not be closed if a transaction is running). But even these layers don't guarantee to catch all potential mis-use (e.g. explicitly commiting or aborting transaction that still has open cursors). You still need to know what you are doing when using the API. BDB also has some particularly complex features; e.g. distributed transactions and replication that can have very unpleasant failure modes qif you are not careful (inc. deadlocked applications, loss of data, or unrolling of previously commited transactions. So before using those features it is essential that you read the relevant docs. If you are unsure about anything, there is usually a working example in the test suite examples/berkeley_db/berkeley_db_test.erl [There are instructions for running the tests at the top of that file.] Status/Quality Of This Release This code is pretty much functionally complete -- all major Berkeley DB APIs are now supported; see examples/berkeley_db/berkeley_db_api_support_status.txt Some APIs have adjusted (e.g. berkeley_db_replication_group_helpers.erl) and various bugs have been fixed (including in Berkeley DB itself -- all via official Oracle patches). The goal is to make this code rock-solid and sufficiently performant for use in high-end mission-critical systems. There is an extesive functional/regression-test suite, and the BDB driver is now used in some real applications, some of which have had several months of intensive testing. So the driver appears to be stable, But as always, Caveat Emptor. Do your own testing, with your own use-cases, and please report any bugs that you see. Important: BDB exposes a LOT of tuning parameters, many of which have such dramatic effects on latency and throughput that they can be 'destabilizing' from an application's point of view (e.g. cause timeouts) if they are set inappropriately. The parameters often need to be tuned to achieve particular application goals (levels of concurrency, latency, throughput) on specific hardware. Tuning the parameters takes familarity with the BDB documentation and often quite a lot of experimentation. To make experimentation as easy and painless as possible, the driver exposes all BDB tuning parameters in a logical and convenient way. I strongly recommend that you become familiar with these parameters. Sidebar: EDTK now has a lot of mostly-independent features, so the combinatorics of testing are pretty heavy. For example: - The berkeley_db driver requires use of private threadpools (to avoid various types of deadlocks). So I haven't tried compiling a driver for which "default_async_calls=0", for which "default_async_calls=1" but "private_async_threadpool=false", for quite a while. Given the problems with the native Erlang threadpool (e.g. just the danger of getting in the way of standard drivers like efile_drv), I'd recommend that most drivers use the private threadpool feature. - The berkeley_db driver uses 'shared' (thread-safe) valmaps for DB_ENV and DB, and 'nested' (parent/child relationship) valmaps for DB_TXN. Both of those EDTK features can be combined, but not with BDB, so that combination has not been tested (even compiled) -- although I believe I did write all of the necessary code for that combination. Note on the other 'example' drivers. Only the Berkeley DB driver has been compiled and tested with this new release of EDTK. The other 'example' drivers that shipped with EDTK v1.1 have **NOT** been upgraded to this version. Indeed, I have never even compiled those drivers. Therefore those other drivers may now be slightly broken. It shouldn't take too much work to get them working again because almost all of the new features in EDTK can be turned off, and the internal APIs are largely unchanged the same. If the other drivers are upgraded, then they can immediately take advantage of new EDTK features such as private threadpools, shared (thread-safe) valmaps, parent/child valmaps etc. If the target library has similar requirements to Berkeley DB (e.g. high cost to creating a port instance, need for coordinated startup/shutdown, advantage in sharing expensive but thread-safe resources) then it would be appropriate to write a port-coordinator for that library -- adapting the BDB port-coordinator as a starting point. (There is a case for making the port-coordinator into a generated file -- i.e. a gslgen template -- because the code to forward 'pipe mode' messages to the original sender of the request is really part of EDTK infrastructure (essentially part of the implementation of the receive_reply function in erl_template.gsl) and should be managed as such. Almost all of the rest of the code in berkeley_db_port_coordinator.erl is BDB-specific. Requirements and Supported Platforms - Operating System I have only tested this on Red Hat Enterprise Linux v3 (the only unix I have access to). I believe it should work unchanged on more recent versions of linux, and other unix platforms that support posix threads. The original EDTK v1.1 code was probably not compatible with Windows without some changes (due to use of pthreads), and that remains true for this version -- in fact probably more so, due to the way threadpools are implemented (under Windows the argument to driver_select must be an event handle, not a pipe handle). I think that windows support should be fairly straight-forward, but I won't have time (or need) to do it myself in the forseeable future. - Erlang Version EDTK now requires Erlang R11B-3 or later, as the ErlDrvBinary structure changed due to supporr for SMP. EDTK supports both smp mode and non-smp mode. The BDB driver automatically uses some optimizations in smp mode as some erl_driver functions are now thread-safe. However, in testing the non-smp driver seems to slightly out-perform the smp driver, presumably due to lock contention. Intensive testing has been done with R11B-3 in non-smp mode. Significant testing was done with R11B-3 in smp mode. Noo problems were found. Some testing (e.g. regression tests and application tests but not stress/performance testing) has been done with R11B-4 in both non-smp and smp modes. No problems were found. **IMPORTANT MAINTENANCE ISSUE** edtk/erl_driver_pipelib.h now contains some structures and macros copy/pasted from private Erlang VM code. These are required to make 'pipe mode' work. These structures are required to all allow pipe_main.c to pretend (to the generated driver shared-library) that it is the Erlang VM. If these structures change in future Erlang releases, then edtk/erl_driver_pipelib.h must be updated. See also the long comments about driver_*_binary() APIs in edtk/erl_driver_pipelib.c and this mailing-list thread: http://www.erlang.org/pipermail/erlang-questions/2006-October/023500.html - Template language The driver still requires GSLgen from iMatix. It uses the same version that EDTK v1.1 uses. If you don't already have it, then the README file (Scott's original) contains instructions. I seem to remember that it was a little fiddly, so here are my notes that I recorded at the time. cd /home/$USER/erlang/downloads/edtk Download http://www.imatix.com/pub/sfl/src/sflsrc21.tgz mkdir sfl; cd sfl gunzip -c ../sflsrc21.gz | tar -xvf - chmod a+rx c build export PATH=$PATH:/home/$USER/erlang/downloads/edtk/sfl ./build Download http://www.imatix.com/pub/tools/gslsrc20.zip mkdir gslgen; cd gslgen unzip -a ../gslsrc20.zip cd src chmod a+rx c build cp ../../sfl/*.h . cp ../../sfl/*.a . ./build If you choose a different directory, edit the GSLGEN_EXE variable in examples/berkeley_db/Makefile accordingly. - The BDB driver now requires the following support libraries. [If you install them anywhere other than /usr/local you will need to alter paths in examples/berkeley_db/Makefile and examples/berkeley_db/releases/make-release.sh] cd ~/erlang/downloads/edtk Download from http://www.pcre.org/ tar -zvxf pcre-6.7.tar.gz cd pcre-6.7 ./configure make 2>&1 | tee make.out sudo make install 2>&1 | tee make-install.out and cd ~/erlang/downloads/edtk Download from https://sourceforge.net/projects/goog-coredumper/ tar -zvxf coredumper-0.2.tar.gz cd coredumper-0.2 ./configure make 2>&1 | tee make.out sudo make install 2>&1 | tee make-install.out - The BDB driver now requires BDB v4.5.20, which can be downloaded from: http://www.oracle.com/technology/software/products/berkeley-db/index.html IMPORTANT: this release of the driver also requires all of the patches in examples/berkeley_db/patches-to-berkeley-db/db-4.5.20 There is a script to apply these in the correct order -- see below. Most of the patches are 'official' patches provided by Sleepycat/Oracle, and will be incorporated into later public releases of BDB. Patches should be applies in the root of the unpacked BDB tree, before configuring and building BDB or EDTK/berkeley. The patches should all apply cleanly; i.e. no 'hunk offset' warnings from the patch utility. e.g. (adjust paths as necessary) cd /home/$USER/erlang/downloads/edtk tar -zxvf db-4.5.20.tar.gz cd db-4.5.20 ~/erlang/downloads/edtk/edtk-1.5/examples/berkeley_db/patches-to-berkeley-db/db-4.5.20/apply-patches.sh cd build_unix ../dist/configure --enable-debug --prefix=/home/$USER/erlang/downloads/edtk/BerkeleyDB.4.5 make make install **IMPORTANT** Your operating system very likely ships with an older version of Berkeley DB. Mixing different 'default' installations of of Berkeley DB on the same host can be very confusing, as the contents of change between versions, and that header may be compiled into existint programs. Therefore I use the --prefix argument to 'configure' to pick a custom installation directory. (The BDB library will be packaged as part of the berkeley_db application anyway.) If you do this but pick a different directory then you may need to edit berkeley_db/Makefile to set BDB_INSTALL_DIR to the directory that you chose. Then build EDTK and the berkeley_db driver, and run the tests: cd /home/$USER/erlang/downloads/edtk/edtk-1.5/examples/berkeley_db pushd ../../edtk; make clean; make; popd; make clean; make rm regression.out; make regression 2>&1 | tee regression.out **IMPORTANT MAINTENANCE ISSUE** Future releases of BDB require that constants in berkeley_db.xml be adjusted. Diff the generated (installed) db.h file from BDB 4.5 with the db.h file from the target BDB release to see what must be changed. Architecture Overview First, please read Scott's original README file which has pointers to the original EDTK documentation, including his excellent paper. The following overview begins with a brief, very high-level ('manager level') summary of how Erlang applications are structured, and how we want them to interface to BDB. It repeats/summarizes a little of the overview in Scott's paper. The later part of the overview explains the new 'helper layers' of the BDB driver that were not present in EDTK v1.1. Everyone using the BDB driver should read that part. Quick Introduction: The Erlang VM runs its own scheduler for Erlang processes. An Erlang process is similar to a user-level thread but it is totally isolated -- it can only communicate with other processes via message-passing; i.e. no shared memory or mutexes. An Erlang application typically consists of hundreds (perhaps thousands or even millions) of these lightweight processes. The idea is to have one process for every concurrent thing in the domain of your application (the 'active object' model). The Erlang VM uses either a single OS thread, or a single OS thread per cpu, to run schedulers for Erlang processes. Erlang processes are scheduled preemptively. They can 'crash' without disturbing other processes (due to no use of mutexes/shared memory). Processes can also monitor each other and get messates or asynchronous signals if another process crashes or exits. Erlang interfaces to the outside world via 'port' objects which are simply message-channels. To use an external library you have to write code to make the library look like Erlang processes -- it can only communicate by sending and receiving messages, and must not block the Erlang VM process/thread. Once a library has such a message-passing interface, it can be used by 'linking it in' to the Erlang VM as a shared object, or spawning another program (another OS process) and communicating with it over a pipe (over the spawned process' stdin and stdout). However, it's not quite as simple as just creating a message-passing interface. Another important part of Erlang is the high degree of isolation between Erlang processes -- e.g. lack of locking/mutexes. But some libraries expose APIs that may block for a long time -- even indefinitely. e.g. Berkeley DB implements a page-oriented database, and includes a locking subsytem to control access to those pages. Therefore, if multiple Erlang processes use BDB concurrently (which we definitely want), those Erlang processes can acquire locks, and block against locks held by other Erlang processes. This applies to a Berkeley DB application written in any language (it's not a property of EDTK or the driver.) Also, Berkeley DB obviously does a lot of disk IO, but it does not support asynchronous IO because it is a highly portable library and there is no standard async IO API cross operating systems. For the same reason, Berkeley DB does not even create any OS threads itself (sidebar: one exception to the latter rule is the new 'replication manager' convenience layer). Therefore, Berkely DB will often block the application's own OS threads -- when blocking against a logical lock, or when doing IO (the latter applies to many kinds of libraries of course, and the new features in EDTK will help when writing drivers for such libraries). Of course it is critical that the Erlang VM's scheduler threads are never blocked by BDB. The standard, supported way to achieve this is using the Erlang 'async thread pool' ('erl +A' and the driver_async() API). Unfortunately that mechanism is not flexible enough for the Berkeley DB driver (and using it would risk interfering with other important drivers like efile_drv). So EDTK now implements private threadpools, and multiplexes commands across those pools. Each driver instance (port) has it's own set of threads, and the pools are resizeable at runtime. (This support only requires the standard Erlang VM and APIs -- it does not require any modifications to the Erlang VM.) The BDB driver currently uses 4 threadpools; see comments at the top of berkeley_db.xml for details. Note that when an Erlang processes 'blocks' against Berkeley DB, it is actually sitting in a receive statement (with a timeout), waiting for a reply from the driver. So to an application, blocking against a BDB lock looks much the same as attempting to access a file on disk, or a connetion to a client/server database like MySql or PostgreSQL, or simply waiting for a reply from a gen_server:call. But to get decent performance (avoid unnecessary blocking), applications using BDB do need to consider access patterns and lock contention. Note that there is one difference between waiting for the result of a BDB command and waiting for the result of a normal Erlang gen_server:call -- BDB commands are processed in threadpools of finite size, which therefore may become backlogged or saturated. There are various mechanisms in place to handle that; e.g. the threadpools can be resized at runtime (while in use), and applications can query the current length of the queues (to decide whether to add more threaes). See also the TODO file for details of some subtle issues that can arise with threadpool saturation.) The role of EDTK: Much of the code in port drivers is tedious boilerplate. EDTK is a code-generation tool that takes a declaration of the target library's API (e.g. see examples/berkeley_db/berkeley_db.xml) and some template files containing boilerplate, and produces the Erlang and C code that implements the message-passing interface for the library. It generates the following code: Erlang code - Provides Erlang 'stub' functions that map to library APIs. - The stub functions marshal arguments into a message. (A new EDTK feature is that we now also pass a a tag that contains a unique identifier that can be used to associate a reply with its request, and the sender of the request.) - Send message to the port - Wait for reply. (It is now possible to generate non-blocking variants of APIs -- these return the request 'tag', which can be used to collect the reply later.) - Unmarshal the reply message and return it to the caller. - The previous release of EDTK would always return results to the caller as {ok, Value} or {error, Value}. This release gives the option (on a per-driver bases) of throwing errors as exceptions by default (and returning just the Result, rather than {ok, Result}, in the success case). See this paper for the rationale of using throw vs {error, Reason}: www.erlang.se/workshop/2004/exception.pdf www.erlang.se/euc/04/carlsson_slides.pdf Note: the BDB down now does this (and this is a significant API change from the berkeley_db driver in EDTK v1.1). See usage examples in: examples/berkeley_db/berkeley_db_test.erl C code - Unmarshals command arguments from the message - Create 'command' objects and enqueue them on the producer/consumer FIFO queue for the appropriate threadpool. - Implement the threadpools -- i.e. consume items from the queues, call the BDB API with the correct arguments, and enqueue the response in an internal queue for the Erlang VM process/thread to collect (and then tell the Erlang VM that the internal queue contains something, by writing to a pipe fd). - Marshals the reply into a message - Sends the reply back to the Erlang VM. The message is sent directly to the process that sent the original request. This is a significant difference between this release of EDTK and previous releases, because it enables multiple Erlang processes to share a single port instance (previous releases of EDTK required that each Erlang process opened its own port instance). However, this feature greatly complicates the implementation details of EDTK, because of the need to do proper cleanup if an Erlang process crashes while owning BDB resources like transactions and cursors. In previous releases that process' port instance would have been closed automaticall, and the driver's C stop() function would have done the cleanup. But in this release the single shared port instance cannot be closed just because one of the processes using it happened to crash. See next point (and - Tracks resource usage by the Erlang application; these are stored in arrays called 'valmaps'. e.g. The BDB driver tables of open DB_ENV, DB, DBC, and DB_TXN handles. If the port driver is closed unexpectedly (e.g. the Erlang process that owns it crashes) then the shutdown C code closes cursors, aborts transactions, closes database handles, closes the environment handle, all in the correct order (even if some are currently in use when the shutdown begins);. Important: In this release, resources (valmap table entries) are tagged as belonging to a specific Erlang process -- any process that wants to use BDB. A command is provided to cleanup just the subset of resources owned by a specific Erlang process (should it crash). But note that this is entirely hidden from application code in normal use (i.e. you don't have to worry about it). See the description of the new 'port coordinator' server later. - Checks that constraints are met; e.g. transactions can only be used by one command at a time, a prepared transaction cannot be automatically aborted if a process crashes (it must be resolved by a GTM), etc. etc. Basically it implements most (but not all) of the pre-conditions/constraints described in the BDB documentation. - Converts asyncyronous events from Berkeley DB (the new DB_ENV->set_event_notify mechanism in BDB v4.5) to an Erlang message and sends them to Erlang (to the port-coordinator mentioned later). Plus about a zillion other details. Using The BDB Driver: The BDB driver can be packaged as an Erlang/OTP 'application', using the (very basic) script: examples/berkeley_db/releases/make_release.sh The resulting directory tree is an OTP 'library application' (including a minimal .app file), like STDLIB. i.e. It is just a collection of code modules; it doesn't have start/2 function for use by application:start(). A Erlang system that uses Berkeley DB will consists of one or more Erlang nodes, each of which loads the berkeley_db application created with the above script, and then opens one (or rarely, more) BDB environments via one of the following two methods: - If replication is not being used then the application starts an instance of berkeley_db_port_coordinator per BDB environment. - If replication is being used, the application starts an instance of of berkeley_db_replication_group (which in turn starts a port-coordinator process, and supervises it), per BDB environment. Both of the above processes are standard gen_servers. Importantly, they both have start_link functions, and can participate in standard OTP supervisor trees. **API note*** In general, all of the BDB driver functions conform to 'modern' Erlang thinking and return a plain Result or throw an exception on error. i.e. They don't return {ok, Result} or {error, Reason}. However there are exceptions to that rule -- e.g. all start_link functions do still return {ok, Pid} or {error, Reason} as OTP supervisors require that. Please read the API documentation at the top of the respective .erl files. What follows here is just a summary: berkeley_db_port_coordinator.erl: - A port_coordinator owns the single port instance that represents and communicates with a single BDB environment on the local machine. A single Erlang VM can run as many port_coordinators (and therefore as many different BDB environments) as it likes. Most applications will only need to use one. IMPORTANT: It is *CRITICAL* that only a single port-coordinator be configured to talk to a given BDB environment at any given time. If two port-coordinators open the same BDB environment then irrevocable data-corruption is almost certain. This may be made safe by use of BDB's DB_REGISTER flag (see BDB documentation). The port_coordinator does the following: - Opens the port driver for an environment (which creates a single DB_ENV handle for that environment). - Provide a very flexible & homogeneous configuration facility for almost all BDB config APIs and parameters, including sensible defaults. (This is quite an important convenience, as Berkeley DB has a *lot* of configuration parameters, scattered across a wide number of APIs. It makes it possible to configure an entire BDB environment by passing a single Erlang term (lists of tagged tuples) to the port-coordiantor at startup.) - Calls DB_ENV->open() to open the BDB environment. Typically it also runs Berkeley DB recovery, to ensure that the database is in a consistent state after any earlier crash. BDB recovery must be coordinated (e.g. single-threaded), so the port-coordinator does this synchronously, before it returns from init(). - If distributed transactions are being used, then it runs DB_ENV->txn_recover and if necessary contacts a GlobalTransactionManager to resolve any unresolved txns that it finds during recovery. [Important: The BDB API makes this a blocking operation -- the port-coordinator cannot continue unless the GTM replies. Hence to avoid serious availability issues, the GTM should be replicated. Work is in progress on that.] The port-coordinator also does 'incremental recovery' of unresolved transactions while the application is running (e.g. if a process calls txn_prepare on it's local part of a distributed transaction, but then crashes before it commits or aborts the transaction, then the port-coordinator takes ownership of that unresolve transaction, and asks the GTM for a decision whether to abort or commit it). Important: distributed transactions should not be used without knowing what you are doing (e.g. all such transactions must use lock or txn timeouts to avoid distributed deadlock). So caveat emptor. - If replication has been enabled then it call DB_ENV->repmgr_start, and repmgr_add_remote_site etc, to join a replication group. (There is a lot more to replication than this; for instance, the election of a 'master' site (only the master can do write operations). Before using replication, read the BDB replication documentation. - Allow client processes to 'register' to use the BDB environment This means that the port_coordinator tracks the life of the client process, and will cleanup after that process if it crashes -- close any open cursors, abort open transactions, etc. - Hand out shared DB handles to clients on demand (one handle per database, shared by all clients). The port_coordinator creates the database on first use (i.e. specifies DB_CREATE to db_open) - Receives all BDB event notifications via the mechanism described below, and publishes them to an arbitrary set of interested (registered) processes via a gen_event handler. This set of registered processes can be entirely independent from the set of registered clients that have asked to use the environment. - Recieves custom events from the C code in the driver. e.g. If an attempt is made to automatically 'clean up' a transaction that has been prepared but not yet explicitly commited or aborted (e.g. one process participating in the distributed transaction crashed), we cannot unilaterally abort that txn because other sites might already have been told to commit. Instead of aborting the txn, the C code sends an event (and a reference to the txn handle) to the port_coordinator, which asks the GTM to resolve the txn -- and the port_coordinator then commits or aborts the local txn. berkeley_db_helpers.erl: This contains convenience functions for using BDB -- e.g. a do_txn() function that takes an entire transaction (packaged as a {Module, Function, Args} tuple), and executes it against a given BDB environment (i.e. it takes a port-coordinator Pid or name). berkeley_db_sequence_server.erl This is a (useful) example of a BDB application. It is a gen_server that implements a classic database 'sequence'. That is, a series of (in this case 64-bit) integers that do not repeat, even if the server crashes (i.e. consumption of integers is persistent). The integers are suitable for as message-ids, primary keys, etc. Note that the sequence is NOT guaranteed to be purely sequential -- there can be gaps (possibly very large gaps), and higher integers may be returned before lower integers (although this is rare). The only guarantee is that the same integer will not be returned twice. Obtaining an integer is typically very fast as the server holds a 'cached range' of integers, and only needs to do a BDB transaction when that range is exhausted. The size of the range can be set when the server is started. Specifying a large range reduces the frequency of transactions, but results in larger gaps in the sequence (irrevocably wasted integers) when the server is restarted (either normally or after a crash), as each restart must begin consuming from the *next* cached range (as it is not know how many integers from the last range were actually consumed). However, with a 64-bit total range there is no lack of integers -- ranges of size 1,000 or more are typically fine. Note that a recent version of Berkeley DB added an almost identical feature (see the documentation for DB_SEQUENCE). However, the BDB driver does not implement those BDB APIs because all calls to BDB must incur overhead in marshalling/unmarshalling, enqueuing to threadpools etc, and most calls to sequence_server only increment and return an integer. i.e. It's faster to keep the cached-range on the Erlang side of the port boundary, and do explicit transactions when a new range needs to be started. In the near future this application will be changed (or a variant created) that supports replication, so that if a host/disk dies permanently, the current value of the sequence is not lost. berkeley_db_global_transaction_server.erl: This is part of the support for distributed transactions. It provides (on demand) 128-byte GUID required for distributed transactions (see the BDB txn_prepare API). It also acts as an authoratative repository for the state of all distributed transactions (i.e. whether they have committed or aborted), which is queried if a partitipant in a distributed transaction fails after preparing it's local part of the transaction (when the partitipant recovers it asks the GTM for the state of the transaction, in order to commit or abort it locally). In the near future this application will be changed (or a variant created) that supports replication. This is vital for reliable use of distributed transactions, as applications can block permanently if the GTM is not available or loses data. berkeley_db_msg_queue.erl: This is a sketch of an example use-case for the BDB driver. It implements an asynchronous transactional message queue with guaranteed exactly-once delivery. It is designed to given similar reliablity guarantees as distributed transctions but avoid the availability issues of 2-phase-commit. In particular, the destination may be unreachable but senders can still transactionally enqueue messages. The idea is that some 'sender worker' process (bit of business logic) performs a BDB transaction do some work, and wants to notify another process (presumably on another node) that the work has been done (and perhaps pass some arbitrary payload with the message). So the sender uses functions in this module to enqueue a message in a Berkeley DB queue database as part of the normal work transaction. That's all the 'sender' has to do -- i.e. there is no 'send message' API, just an enqueue API. A separate 'send pump' process transactionally consumes messages from the local message queue, sends them to a destination 'receive pump' process, waits for an ack, and then commits the consume transaction. (If a consume transaction is aborted the message automatically reappears in the queue, ready for a retry.) The 'receive pump' process (on another host/BDB environment, otherwise there is no point) receives messages, transactional enqueues and commits them to a local queue database, and then replies with an ack to the 'send pump'. (Note that if this ack is lost the 'send pump' will abort it's consume transaction and try to send the message again, so to guarantee exactly-once delivery all messages have a guaranteed-unique id (an instance of sequence_server is used for this) and the receive pump records the ids of all messages that it has commited to its local queue -- and simply discards and acks (again) any resends of messages that it has already accepted.) Now a 'receiver worker' process (another piece of business logic) on the same host as the 'receive pump' process can transactional consume messages from the receive queue, process them, and if the work transaction succeeds, commit the consume transaction. One implementation detail: Berkeley DB queue databases have fixed-size records, but we want to support arbitrary payloads. So if a message won't fit into the configured maximum queue record length, part of it is stored in a btree database. The two databases are manipulated in the same (nested) transactions, so they will always remain consistent. This code has not yet been 'productized' or even fully tested. In particular it needs a better routing layer (currently it uses 'global' process registration of the receive pump process, as an experiment -- see later for why this is not a good idea) So treat it as an educational example -- i.e. caveat emptor. Also, this subsystem is intended to run with replication, so no messages can ever be lost even if a host or disk fails. But the implementation does not yet support that. There is a usage-example/test of this code in berkeley_db_test.erl. berkeley_db_replication_group.erl: The replication_group processes are responsible for tracking the state of all sites in a replication group, including the location of the master. A replication_group server does the following (see source file for more details): - Spawns and supervises a local port_coordinator and registers as an subscriber to it's event channel. - Attempts to find all other sites in the replication group (and retries if they are down). These other sites might be on other Erlang VMs (on this host or other hosts). - Optionally it can create the other sites if they are down (at startup or if they crash for any reason). So an entire replication-group across multiple nodes/hosts can be started by a single API call. Note that the sites will all attempt to 'peer supervise' each other -- i.e. restart each other if they crash. This is an experimental feature. There is nothing wrong with starting each site in a replication_group via a local supervision tree on each node (infact that is the recommended approach). - Listens for and tracks status updates from all sites - Implements a simple state machine that converts sleepycat events from all sites to absolute states. - (Important convenience). All sites are willing to accept 'do_txn_on_master' calls, and (if the local site is not the master) will transparently forwards user transaction functions to the current master site. For this API, transactions are 'packaged' as {Module, Function, Arg} tuples. Note that funs are NOT used due to various issues with code upgrade (we want to be able to upgrade code across a replication group on a per-site basis, and funs are bound to the hash of their module; {Module, Function, Args} tuples do not have that problem. - All sites are willing to acdept 'do_txn_on_any_site' calls (i.e. read-operations). If the local site is 'in-sync' with the master (or if no master is running) then the local site will perform the transaction. Otherwise the transaction will be sent to a site that is in-sync. - Handles adding/removing sites to the group. (This is not yet fully implemented -- it requires BDB features which will hopefully be released in Berkely DB v4.6 in 2007). - Coordinates clean shutdown of all sites in the group (when the application shuts down). This involves suspending writes and waiting for all sites to fall into sync. (This is not yet implemented -- again in needs features in Berkeley DB v4.6.) Approach To Distributed System Communication When configuring a replication_server, each site needs to know about all of the others in the group. Sites exchange replication data (handled by BDB internals), and coordination messages and the Erlang level (including tracking which site is master). Note that Berkeley DB's replication layer makes its own connections -- replication traffic does *not* pass through Erlang processes. The BDB APIs for this require TCP/IP addresses, of the form {HostName, Port} tuples. But the Erlang-level coordination (tracking of master, forwarding 'transaction functions' to be executed on the master and then returning the results) use Distributed Erlang. However, the code restricts itself to a 'safe subset' of Distributed Erlang features. In particular, BDB replication is designed to work correctly in even if the network is partitioned temporarily. Some Distributed Erlang features do not work/recover well in the presence of network partitioning. The code does use the following features of Distributed Erlang: - the basic connection setup/heartbeat/teardown/retry stuff (named nodes, epmd, net_kernel etd) - 'remote pids', including link/monitor of pids on remote nodes In particular, the code makes frequent use of *local* process registration and the {ProcName, NodeName} form of addressing remote-processes (e.g. to gen_server:call() et al). We also use remote-pids as 'cached' forms of these addresses. The code does *not* use the following features of Distributed Erlang as they are rumoured or known to have problems under network partitioning.: - dist_ac - global - pg, pg2 - mnesia Another potential issue with using Distributed Erlang is that it only uses direct routing, and by default it attempts to maintain a fully-connected network (N^2 connections). This imposes a ultimate limit to scalability. net_kernel can be set to do lazy on-demand connections rather than transitive proactive connections, but unless application communication patterns are constrained, it will eventually create a fully-connected network. The largest Distributed Erlang system I've heard of (from a mailing list post) is 80 nodes. It seems quite feasible to have low-hundreds, but possibly not thousands until the epoll patch is made official. The replication_server code is designed to run with relatively small groups (3 to maybe 20 or 30 nodes). It has only been tested with up to 5 nodes. One option is to use Distributed Erlang within each replication group/cluster, and some other protocol between clusters. berkeley_db_replication_group_helpers.erl The fact that replication groups don't use any kind of global process/site registration service invites the question, how does application code find a site in order to run a transaction? This is a particularly relevant question given that the whole point of replication groups is to be highly available even if some percentage of their sites are down or unreachable. First note that some applications don't have this problem because they use only a single replication_group, which has a site on every node in the system. Therefore the application just attempts to use the site on it's local node, as if the application is running then it's highly likely that the replication site will be running (or will be soon, if it is running recovery). The local site will forward any write transactions to the master, and will perform any read transactions locally, so long as the local site is 'in sync' with the master (processing live updates, not doing bulk-recovery, aka 'bootstrapping'). But if a system does not have a replication site on every node, (e.g. if the system is a set of replication-clusters), then we need to be able to find and use 'remote' replication sites. It is assumed that the entire user application (i.e. every node) *does* know the names and nodes of all the replication sites (this information is required to start each replication_group server, so it must be globally available somehow). So all we really need to access 'remote' replication sites are some helper functions: - to remember the latest 'master location hint' (to optimize routing of write transactions) - to load-balance read transactions across all sites that are believed to be reachable - to remember which, if any sites, we have found to be unreachable (so we don't try to send requests to them for a while) - to periodically try to contact any unreachable sites, to see if they have recovered yet As replication groups normally exist in a steady state without failures (most/all sites running, and a stable master), we simply need to discover that state, remember what it is (as an optimization), and re-discover it fairly quickly after a change (e.g. if the master becomes unreachable we need to stop trying to connect to that site for a while, and also find the location of the new master). The functions in this module accomplish this. They manipulate an opaque object called a caller_state (record). The definition of this record is available in a header file incase it is useful, but most of the time it should be possible to treat it as an opaque token. The idea is that each node stores the caller_state for each replication group somewhere -- an ETS table is perfect. Then when a process wants to do a transaction on a replication group, it simply looks up the caller_state for that group, and passes it to one of the helper functions in this module (do_txn_on_master or do_txn_on_any_site), along with the {Module, Function, Args} tuple that it wants to execute. In the candidate 1.5 releases this module had not been fully tested. It has now been tested, and the API should be stable. Also, support has been added for distributed transactions against the masters of separate replication groups (see do_distributed_txn_unordered). List of Changes since EDTK v1.1 (might not be totally complete -- there were a lot of changes). Please see doc/EDTK_BerkeleyDB.ppt for some design rationale and explanation. The changes can be categorizes as follows: - Enhancements to EDTK - Enhancements to BDB driver - Bug fixes to EDTK - Bug fixes to BDB driver Enhancements to EDTK framework (not driver-specific) - The previous release of EDTK used the Erlang VM's 'async IO threadpool' for Berkeley DB operations, but that has several critical problems. - Long-running BDB operations would block IO by other important drivers (like efile_drv which provides access to the file-system). e.g. Some BDB operations may take a very long time (e.g. txn_checkpoint, memp_trickle, db_compact), or may block a thread indefinitely (e.g. db_get with the DB_CONSUME_WAIT option). - The native Erlang VM's threadpool schedules jobs on a round-robin bases, or by using a hash of an application-specified key. The round robin mechanism would violated BDB preconditions on thread use (if it did not deadlock first), and the hash mechanism would require a dedicated thread for each *possible* concurent BDB transaction (so hundreds of threads, most of which would be idle), and would still not be safe from deadlock (as some BDB operations use private internal transactions, so the application does not have a 'key' to hash to a thread id). - The native Erlang threadpool is sized at runtime (with 'erl +A'), but we want more flexibility than that. This is fixed in this release by allowing each port instance to create multiple private worker thread pools, dedicated to specific classes of operations. Each threadpool is fed by a producer-consumer queue (unlike the native Erlang VM threadpool in which each thread has its own queue, so some threads may be busy and have work stuck in their queues while other threads sit idle). The threadpools may be resized at runtime (threads added/removed), the stack size for threads may be specified (to reduce virtual memory usage with large numbers of threads -- Berkeley DB only needs about 32KB -- 64KB stack), and the application can place limits on the lengths of the threadpool queues, to reject new commands under saturation. (The length limits can be changed at runtime.) - Added support for shared-access valmaps. Some libraries may produce resources (valmaps) that are thread-safe, and can safely be used by multiple Erlang processes at once. (e.g. Berkeley DB DB_ENV and DB handles can be made thread-safe). So a valmap type may now be declared as 'shared', and appropriate reference-counting will be done to manage automatic cleanup. Also, some operations on shared valmaps may need exlcusive access (e.g. any 'stop' operation like txn commit/abort, or cursor close). Such operations do test-for and obtain exclusive-acces. This feature is an important enabler for the BDB driver, as DB_ENV and DB handles are documented as being slow to open (and expensive in terms of memory and other resources). So for large numbers of Erlang processes to use BDB together, it is essential to share these handles. Also many BDB features (e.g. replication) are more difficult to use if more than a single DB_ENV handle is open on an environment at a time. - Added support for parent/child relationships between valmaps slots (*of the same type*). The relationship force correct ordering of cleanup operations. i.e. If a valmap with any children is 'stopped' then all of its children (recursively) are automatically 'stopped' first. This feature was necessary to support BDB nested transactions correctly. - Added EDTK support for converting numeric library parameters and return codes to/from atoms. Berkeley DB has a _lot_ of public #define constants of the form DB_xxx which are used as flags, enums, and/or return codes. These constants can and frequent do change numeric value between releases of BDB. That's fine for a C application (it just needs to recompile to pick up the new constants), but a distributed Erlang application must deal with different sites running different versions of BDB (during an site-by-site upgrade), and it is critical that the meaning of the parameter/return code values not be mis-interpretted. So now the Erlang stubs accept atoms or lists of atoms for these parameters, and convert them to the correct numeric constants at the last possible moment before calling the C code. - Added EDTK support for pattern matching on selected return values from the C driver, and taking arbitrary actions (e.g. throwing, exiting, transforming the result before returning it). - Added EDTK support for throwing all {error, Reason} return values as Erlang exceptions by default, and pattern-matching selected return values to be returned as {error, Reason} tuples. The Berkeley DB API now uses this form of interface, which makes application code *much* more convenient. The previous EDTK release predates system probably did not have this feature due to the fact that Ericsson only recently added greatly improved exception handling to Erlang (try/catch/after, rather than the old catch construct). - Added EDTK support for returning large complex structures from the driver using the 'Erlang external term' format (the 'ei' and 'erl_interface' C libraries provided by Ericsson). This is used in supporting the BDB 'statistics' APIs. - Improved logging when debugging EDTK and/or driver code. e.g. Each port instance may now have a textual 'label' per port-instance, to distiguish log messages from different instances running on the same node. (The BDB port-coordinator sets this label to the registered name of the port-coordiantor process.) Enhancements to BDB driver - EDTK v1.1 did not support some essential parts of the Berkeley DB API, largely due to safety/correctness issues arising from the lack of flexible threadpools. Support for the following has been added: - transactions, including nested transactions - distributed transactions, and recovery (including a GlobalTransactionManager) - important 'housekeeping' functions such as txn_checkpoint, log_archive, memp_trickle etc. - all 'statistics' APIs : these return large structures full of useful internal data - replication : a large, complex topic on its own - db_compact : 'defragmentation' of live btree databases - db_get(DB_CONSUME_WAIT) : blocking consume-from-DB_QUEUE-database See examples/berkeley_db/berkeley_db_api_support_status.txt for a summary by API. - One of Erlang's main strengths is that it is reasonable/normal to have thousands of Erlang processes running on a single node. And of course we want a lot of Erlang processes to be able to simultaneously use the C libraries that are wrapped by EDTK. The previous release of EDTK had an implicit (but totally reasonable) assumption; that.concurrent access to a C library from Erlang would be achieved by opening a separate port (a separate instance of the driver/library) from each Erlang process that wants to use that library. It's hard to argue with that model -- its clean and simple, and most importantly, the Erlang VM guarantees that all ports opened by a process will be automatically closed if that process exits. So (with appropriate C code in the driver's stop() function), cleanup is automatic. Unfortunately that model is not a good fit for Berkeley DB, for important practical reasons. - BDB requires coordinated (single-threaded) startup and shutdown. Also there are houskeeping jobs like txn_checkpoint, log_archive etc. that should be run per environment (and must be coordinated). - Opening BDB environment and database handles is slow. Fortunately these handles can support shared-access (they are thread-safe), and the BDB docs strongly advise applications to share them to achieve good performance. But with transient worker processes would be opening and closing DbEnv and Db handles per transaction -- i.e. performance would be dire - Even if opening a separate DB_ENV handle per port was fast, many BDB features work best with only a single DB_ENV handle (e.g. replication). - The Erlang VM places a hard limit on the total number of ports (of any kind) that can be opened simultaneously. - We probably want to limit the number of processes using BDB (to avoid thrash), but in the one-process-one-port model there is no way to limit the degree of concurrent access (number of open port instances) other than running into the VM's hard-limit (which would starve the system of other kinds of ports). All of those problems are fixed in this release allowing multiple Erlang processes to share one instance of a port. The sharing of a port is accomplished by a few changes - A new module called 'berkeley_db_port_coordinator.erl' that coordinates startup, shutdown, sharing of DB handles, incremental resolution of distributed transactions, and other things. - A small change to linked-in mode to use driver_caller and driver_send_term instead of driver_output_term, to allow the reply to be sent to the process that sent the command, rather than always to the ports connected process - All commands to an EDTK-generated driver now carry a unique tag (currently term_to_binary({Ref,SendPid}) to precisely associate replies with their commands. In pipe mode (when all replies arrive at the port's 'connected process' -- i.e. the port coordinator), we have enough information to forwarded the reply to the process that sent the command. The tag mechanism also allows asynchronous (non-blocking) variants of commands to be generated. i.e. It is possible to generate a function to send the command (and return the generated tag), and another function to do a blocking receive for the reply to that command (the function requires a tag of course). This can be useful in some circumstances -- e.g. servers that want to use Berkeley DB but don't want to spawn worker processes for each operation. The port-coordiantor uses this mechanism for some internal operations (e.g. during shutdown). Currently only a few BDB API functions have non-blocking variants -- simply to avoid code-bloat (most applications won't need them). - dbenv_close and dbenv_remove now flagged as global-exclusive operations. These are not allowed if any valmaps are in-use at all. Bug fixes to EDTK - EDTK's 'valmap' system for managing handles had a known, dangerous weakness; Erlang applications may accidentally use expired valmap entries (i.e. references to transactions that have been commited/aborted), which will crash the Erlang VM. This has been fixed with per-slot generation-ids. Each valmap slot has a 32-bit counter. Eadh 'valmap record' returned to Erlant contains the value of that counter. Any valmap 'stop' ('free') function increments the value of the counter. So if valmap record is used after it is 'stopped' (e.g. a db_txn handle is used after it is committed or aborted), the counter value shows that the valmap record has expired, and an error is thrown. - The value returned by the wrapped library was leaked if no valmap slots are free after the call. The solution was to reserve the slot when the command is received. - Driver shutdown didn't acquire a critical mutex, but operations may still be in-flight (race condition/undefined behavior) - Driver shutdown didn't wait for in-progress operations involving first valmap type to finish before it starts closing valmaps of the second type (and so on) - Driver shutdown did not call cleanup_.._index() for any valmap entries that are INUSE, so the code in cleanup_.._index() that sets DELAYED_CLEANUP is never activated, so some valmaps (that are in-use when stop is called) are never cleaned up. (DELAYED_CLEANUP would self-deadlock if it was ever used, as cleanup_index() was called while desc->mutex was held, and cleanup_index tried to acquire the (non-recursive) mutex again). - Driver shutdown could consume infinite stack if it was not safe to return (valmaps were still in use) - Cleanup_index held mutex while calling the valmap cleanup function (e.g. txn_abort, for BDB) - which may take a long time. - Several bugs in 'pipe mode' prevented Berkeley DB from being shut down cleanly. - The DELAYED_CLEANUP feture would self-deadlock if used, as it re-acquired a non-recursive mutex. - If a successful valmap 'start' (allocation) operation is performed after driver shutdown has begun then the object would be leaked (the async_free call is currently just sys_free (to free the callstate object), so nothing currently calls the valmap cleanup_func on the result of the library call made by invoke -- i.e. an object returned from the library (bdb) is leaked) Bug fixes to BDB driver - The EDTK v1.1 examples/berkeley_db driver installed allocation functions that called driver_alloc_binary in async workers (from within BDB code). This resulting in memory corruption and segfaults under concurrent load. Amusingly, this practice actually became legal in Erlang R11B but only when SMP support is enabled. I have a TODO to add this back into the driver (it is a potentially important optimization, as it avoids otherwise redundant allocations and copying of data). See this thread on the Erlang mailing list for related details: http://www.erlang.org/pipermail/erlang-questions/2006-October/023500.html - db_rename, db_remove, db_upgrade were not tagged as valmap "stop" operations, so the Db handle was left in the valmap array, and would crash during _stop(). - Removed support for env_set_errpfx as it made BDB refer to memory that had already been released (undefined behaviour) - Data returned by db_get, c_get etc would be leaked if driver shutdown was started before they completed. - To avoid potential deadlocks of the Erlang VM, BDB's auto-deadlock detection feature is now explicitly enabled during driver shutdown, as the application may have been calling the explicit deadlock detector frequently, but it can no longer do so once shutdown has begun, and operations might still be in progress that could deadlock Support I intend to support and enhance this library as time permits. Please send questions, suggestions, bug reports, and patches to chris.newcombe@gmail.com End of file.