This is an old revision of the document!

CCE Indexing - The need for speed

Background

One of the challenges of CCE is the lack of indexes for searching. For example. If you have 10,000 email addresses and you want to add another one, there is a requirement to check that the proposed new email address does not exist. Linear behaviour when searching for X records is not exactly conducive to a responsive system.

So. We need to add indexing to help improve performance. The Cobalt authors intended to add BDB indexes for other key's in CODB, but this was never implemented. This document is an attempt to describe the possible future architecture of how we can add indexing to CODB.

Current Indexing

Class Indexing

Before talking about what we might do to improve the performance of searches that require matching on object attributes, it is worth reviewing the current indexing that already exists for simple class searches. If we execute a find <classname> in cce, an existing berkley index is used. This index is a BDB file located at /usr/sausalito/codb/db.classes

The contents of this index are pretty simple. We search by classname X in the database, and the index returns a list of OID's that match this classname. When we create a new OID, we add an entry to the index for classname X that adds OID Y to the existing list of OID's. Conversely, when we delete an OID, we load the OID list for the class in question, and remove the OID of the object to be deleted from the index.

Free OID Indexing

In addition to the above index, there is a simple text list of OID's that describes the currently in use OID numbers. This file lives in /usr/sausalito/bin/codb.oids

This index is a plan text file that contains a list of OID's that are in use. This is a comma separated list, that can also contain ranges of utilised OID's.

When an object is created or deleted, this file is updated to add or remove the appropriate OID number from the text file.

One observation. Theoretically, we could get the list of used OID's from the above class index instead of the text file. Why did the Cobalt developers decide to use a text file instead of the existing OID list? We already have the list of known “classes” that are defined, based on the in memory information that comes from the schema files during boot up. Why not just do a find on all of those classes to get an array of allocated OID's? My assumption there is that the use of the current structure would require a single read to a single file to find the first OID number that exists after the first comma in the current text file. This has the benefit of being faster for object creation, but necessitates the requirement to have an extra index that might potentially be out of sync either with the real object directories, or the class index.

Locking

Current updates to the above class index and used OID list will of course require protection through locking mechanisms. I have not yet reviewed the code to understand how the current locking mechanisms work.

Data Integrity

As mentioned above, there have been numerous occasions where the current indexes have been out of sync between the used OID list, the OID directory, and the class index. There are scripts available to rebuild the allocated OID list, but there is currently no tool available to rebuild the class index. When this file gets corrupted, there is a requirement to build a new server… or engage Michael who has some custom scripts that rebuild the entire CODB database. But these scripts are not released to the public as they are not currently ready for production use.

The Future

Strategy

To speed all of this up, the end goal would be to extend the CODB code to manage the indexes for us. However, there is a lack of C programming skills within the dev team. With that in mind, it is our intent to initially “wedge” some index management into the libraries that talk to CODB so that will maintain our new indexes. At some point in the longer term, it would be desirable for the CODB code to be updated in a manner that is consistent with the wedge code that we will introduce in the perl and php libraries.

Something or everything

The first question we need to answer, is what do we index? If we index everything, this can quickly turn into a lot of index data that would more than double the current CODB storage requirements. There are a number of factors that will impact this decision.

Everything

If we are doing a search that needs to match on multiple attributes, we can use indexes on all attributes, and find the common OID's across those indexes… but as mentioned above, the data storage requirements to index on every attribute will be enormous. If we index on a subset of attributes, that will of course be more efficient from a storage perspective. Lets look at that…

Something - Indexed and non indexed attributes

What if we have a search that has attributes to match that have a combination of indexed and non indexed fields? Do we duplicate the CODB search code in our new wedge that will only search the subset of objects that matched from the indexed attributes? That would be undesirable… If we do not index everything… when we get a new search request - we will first check if each attribute “is_indexed”. If they are all indexed, we will use our indexes to get a fast response. If any of the attributes in our search criteria are NOT indexed, we will fall-back to a non indexed search using the existing CODB code.

Something - All Indexed attributes

This is our best search case. If we are searching and ALL of our attribute criteria have indexes available - we will just build a list of OID's from each of the indexes - and then iterate in memory for the list of OID's that are in all indexes.

What to index

This is the next piece of research that needs to happen. We need to look at the existing searches that are regularly performed across CODB, and find out which are our common attribute patterns for our search criteria.

What to index - where to store our "is indexed" definitions

Lets say we define our list of attributes to index. We need to store that somewhere. My thinking is that we would store that in the System object. On startup, we would have a constructor that would read that field - and build a new “indexed” index that we can use later on.

The actual indexes

Assuming we use BDB indexing, I would assume that we are not storing the actual data in the indexes. Instead, we will just store hashes in the indexes. Our structure could therefore be class.attribute.hash as our search key, with an OID as the resultant return data…. or class.namespace.attribute.hash if this is a namespace indexed field…

What now

Before writing any pseudo code, I think we need to review the existing CODB code and documentation to see what was intended by the Cobalt designers.