CCE Indexing - The need for speed

Background

One of the challenges of CCE is the lack of indexes for searching. For example. If you have 10,000 email addresses and you want to add another one, there is a requirement to check that the proposed new email address does not exist. Linear behaviour when searching for X records is not exactly conducive to a responsive system.

So. We need to add indexing to help improve performance. The Cobalt authors intended to add BDB indexes for other key's in CODB, but this was never implemented. This document is an attempt to describe the possible future architecture of how we can add indexing to CODB.

Current Indexing

Class Indexing

Before talking about what we might do to improve the performance of searches that require matching on object attributes, it is worth reviewing the current indexing that already exists for simple class searches. If we execute a find <classname> in cce, an existing berkley index is used. This index is a BDB file located at /usr/sausalito/codb/db.classes

The contents of this index are pretty simple. We search by classname X in the database, and the index returns a list of OID's that match this classname. When we create a new OID, we add an entry to the index for classname X that adds OID Y to the existing list of OID's. Conversely, when we delete an OID, we load the OID list for the class in question, and remove the OID of the object to be deleted from the index.

Free OID Indexing

In addition to the above index, there is a simple text list of OID's that describes the currently in use OID numbers. This file lives in /usr/sausalito/bin/codb.oids

This index is a plan text file that contains a list of OID's that are in use. This is a comma separated list, that can also contain ranges of utilised OID's.

When an object is created or deleted, this file is updated to add or remove the appropriate OID number from the text file.

One observation. Theoretically, we could get the list of used OID's from the above class index instead of the text file. Why did the Cobalt developers decide to use a text file instead of the existing OID list? We already have the list of known “classes” that are defined, based on the in memory information that comes from the schema files during boot up. Why not just do a find on all of those classes to get an array of allocated OID's? My assumption there is that the use of the current structure would require a single read to a single file to find the first OID number that exists after the first comma in the current text file. This has the benefit of being faster for object creation, but necessitates the requirement to have an extra index that might potentially be out of sync either with the real object directories, or the class index.

Locking

Current updates to the above class index and used OID list will of course require protection through locking mechanisms. I have not yet reviewed the code to understand how the current locking mechanisms work.

Data Integrity

As mentioned above, there have been numerous occasions where the current indexes have been out of sync between the used OID list, the OID directory, and the class index. There are scripts available to rebuild the allocated OID list, but there is currently no tool available to rebuild the class index. When this file gets corrupted, there is a requirement to build a new server… or engage Michael who has some custom scripts that rebuild the entire CODB database. But these scripts are not released to the public as they are not currently ready for production use.

The Future

Strategy

To speed all of this up, the end goal would be to extend the CODB code to manage the indexes for us. However, there is a lack of C programming skills within the dev team. With that in mind, it is our intent to initially “wedge” some index management into the libraries that talk to CODB so that will maintain our new indexes. At some point in the longer term, it would be desirable for the CODB code to be updated in a manner that is consistent with the wedge code that we will introduce in the perl and php libraries.

Something or everything

The first question we need to answer, is what do we index? If we index everything, this can quickly turn into a lot of index data that would more than double the current CODB storage requirements. There are a number of factors that will impact this decision.

Everything

If we are doing a search that needs to match on multiple attributes, we can use indexes on all attributes, and find the common OID's across those indexes… but as mentioned above, the data storage requirements to index on every attribute will be enormous. If we index on a subset of attributes, that will of course be more efficient from a storage perspective. Lets look at that…

Something - Indexed and non indexed attributes

What if we have a search that has attributes to match that have a combination of indexed and non indexed fields? Do we duplicate the CODB search code in our new wedge that will only search the subset of objects that matched from the indexed attributes? That would be undesirable… If we do not index everything… when we get a new search request - we will first check if each attribute “is_indexed”. If they are all indexed, we will use our indexes to get a fast response. If any of the attributes in our search criteria are NOT indexed, we will fall-back to a non indexed search using the existing CODB code.

Something - All Indexed attributes

This is our best search case. If we are searching and ALL of our attribute criteria have indexes available - we will just build a list of OID's from each of the indexes - and then iterate in memory for the list of OID's that are in all indexes.

What to index

This is the next piece of research that needs to happen. We need to look at the existing searches that are regularly performed across CODB, and find out which are our common attribute patterns for our search criteria.

What to index - where to store our "is indexed" definitions

Lets say we define our list of attributes to index. We need to store that somewhere. My thinking is that we would store that in the System object. On startup, we would have a constructor that would read that field - and build a new “indexed” index that we can use later on.

The actual indexes

Assuming we use BDB indexing, I would assume that we are not storing the actual data in the indexes. Instead, we will just store hashes in the indexes. Our structure could therefore be class.attribute.hash as our search key, with an OID as the resultant return data…. or class.namespace.attribute.hash if this is a namespace indexed field…

What now

OK. So lets talk about the future…

Assumption 1. We don't want to index everything. If we did that, the overhead would be massive, and the overhead would be undesirable.

Assumption 2. If we don't want to index everything, we want to have CODB “learn” what to index based on past queries, rather than have a static definition of what needs to be indexed.

So - lets talk about the code

FIND

We would have ONE index file. That file would initially have no entries. The first time we do a CODB find on an attribute, the call would go to the legacy CODB process. If the number of objects returned is greater than one, we would create index entries for each of the attributes that were part of the search criteria. The index KEY would be class.<optional-namespace>.key.hash, with the OID's matched. In addition to this, we would add an index entry to the same file isindexed.class.<optional-namespace>.key with a return value of 1.

Future find requests would check if a key is indexed first. And if it is, the index will be used to return a result. If there are multiple attributes being searched, the result will be an set intersection of searches across each of the attributes being searched

SET

Set transactions will similarly search the isindexed for each of the attributes being modified. For each attribute being changed, it will first 1. Delete the index entry for the previous value, and 2. Add a new index entry for the new value.

CREATE

For each CREATE request, the system will look at isindexed for each attribute being stored. If an attribute is marked as indexed, a new entry will be added to the index for the respective attribute values.

DESTROY

For each DESTROY request, the system will look at all current attributes. For all isindexed attributes, the system will destroy index entries for corresponding attributes. Or, better still - if the index allows searching by OID's, the system will destroy all index entries for the matching OID being destroyed.