Introduction

A simple implementation of a multi-attribute linear-hashed file structure database

Github Repo

Structure

A MALH relation R is represented by 3 physical files:

  • R.info Containing global information
    • a count of the number of attributes
    • the depth of main data file (d for linear hashing)
    • the page index of the split pointer (sp for linear hashing)
    • a count of the number of main data page
    • the total number of tuples (in both data and overflow page)
    • the choice vector (cv for multi-attribute hashing)
  • R.data Containing data pages
    • offset of start of free space
    • overflow page index (or NO_PAGE if none)
    • a count of the number of tuples in that page
    • the tuples (as comma-separated C strings)
  • R.ovflow Containing overflow pages, which have the same structure as data pages

When a MALH relation is first created, it is set to contain a \(2^n\) pages, with depth \(d=n\) and split pointers \(sp=0\). The overflow is initially empty.

alt text

After 294 tuples have been inserted, the file becomes (depending on field value distributions, tuple sizes, etc):

alt text

Example of a page split:

alt text

Pages in MALH files have the following structure: a header with three unsigned integers, strings for all of the tuple data, free space containing no tuple data.

alt text

Commands

Create

Creates MALH (multi-attribute linear-hashed) files by accepting 4 command line arguments:

  • the name of the relation
  • the number of attributes
  • the initial number of data pages (rounded up to nearest 2n)
  • the multi-attribute hashing choice vector

Example:

$ ./create R 4 6 "0,0:0,1:1,0:1,1:2,0:3,0"

makes a table called abc with 4 attributes and 8 inital data pages (6 will be rounded up to the nearest 2n)

The choice vector (4th argument):

  • bit 0 from attribute 0 produces bit 0 of the MA hash value
  • bit 1 from attribute 0 produces bit 1 of the MA hash value
  • bit 0 from attribute 1 produces bit 2 of the MA hash value
  • bit 1 from attribute 1 produces bit 3 of the MA hash value
  • bit 0 from attribute 2 produces bit 4 of the MA hash value
  • bit 0 from attribute 3 produces bit 5 of the MA hash value

Insert

Reads tuples, one per line, from std input and inserts them into the relation.

The bucket where the tuple is stored is determined by the appropriate number of bits of the combined hash value. If the relation has \(2^d\) data pages, then \(d\) bits are used. If the specified data page is full, then the tuple is inserted into an overflow page of that data page.

Example:

$ echo "100,xyz,abc" | ./insert R
# prints the hash value for the tuple

Query

Run selection and projection queries over a given relation. It supports wildcard and pattern matching, finds all tuples in either the data pages or overflow pages that match the query, as well as flexible attribute projection without distinct

Example:

$ ./query [-v] 'a1,a2,...' from RelName where 'v1,v2,...'
  • ‘a1,a2,…’ (or ‘*’): a sequence of 1-based attribute indexes used for projection, can be ‘*’ to indicate all attributes. The minimal ‘a’ value is ‘0’
  • ‘v1,v2,…‘: a sequence of attribute values used for selection

Each $v_i$ in the selection tuple can be:

  • Literal value: A specific value that must match exactly in the corresponding attribute position. (e.g., ‘xyz’ matches ‘xyz’, ‘64’ matches ‘64’)
  • Single question mark ‘?’: Matches any literal value in the corresponding attribute position. (e.g., ‘?’ matches ‘xyz’, ‘?’ matches ‘64’)
  • Pattern string containing ‘%’: A string that includes one or more’%’, where each ‘%’ matches zero or more characters.
$ ./query '*' from R where '?,?,?'
# matches any tuple in R

$ ./query '3, 1' from R where '10,?,?'
# projects attribute 1 and 3 in order from tuples where attribute 0 is 10

$ ./query '*' from R where '?,%xz%,?'
# matches any tuple where attribute 1 contains 'xz'

Status

Check the status of the files for table R with stats command:

$ ./stats R
# prints out the stats of R

dump

Show tuples, bucket-by-bucket

$ ./dump [RelName]

gendata

Use gendata to generate large numbers of tuples, and insert them as:

$ ./gendata 800 3 100 | ./insert R

Insert 800 tuples into the table, with ID values starting at 100.

clean

Remove Rel.data Rel.info Rel.ovflow. If no argument provided, remove every existing relation.

$ ./clean [RelName]