Posted By: Bass | Jun 20th @ 7:46 AM
page 1 of 1
Comments: 17 | Views: 437
Bass
Bass
www.s​preadfirefox.c​om/5years/

Create - given some data, store it and return an ID to access it later

Update - given some data and an ID, replace already existing data at that ID

Search - given an ID or some data, search for similar IDs using some secret sause mathematical relationship

Delete - given an ID, clear or delete the contents of the data in that ID

 

The primarly goal is to make sure the search operation is fast. The search operation probably can not be implemented in a non-linear fashion, but I haven't investigated the mathematics in great detail. I'm thinking the best way to implement this should search be linear is using an array. Anyone else have any suggestions?

Without knowing the context it's a little hard to give advice, but if you want to search items using a key, I'd use a Dictionary object to store my items.

Dr Herbie
Dr Herbie
Horses for courses

Why would you need to iterate for CRUD operations? I thought a dictionary the keyed lookup was faster than iterating through an array (there may be a case where small arrays are faster, but you'd have to perf analyse that).

The .NET dictionary allows access to the Keys and values as separate lists should you need them. 

Of course the real solution is to create a generic interface to your system and perf test with different data structures Smiley  Borland's BIDS library was a great example where the underlying data structure (array, linked list, doubly linked list) was changeable for each algorithmic data type (queue, stack, etc), allowing you to test the performace of different implementations.

Herbie

 

EDIT:  Additionally, you might want to look up how relational databases handle these types of things:  probably a balanced tree implementation would be a good place to start for fast searches.

Dr Herbie
Dr Herbie
Horses for courses

Right, OK, so not really your basic CRUD then. Wink

The way I think of  balanced tree is that it's optimised to compartmentalise data through one dimension, but you need to comparmentionalise data through N-dimensions.  So you need a way to group the data into blocks where each block represents a partition of the n-dimensional space, but in such a way that you can quickly tell if a specific point is greater than or less than each of the n-dimension partitions of that block so you can 'zoom in' to the correct place easily.

Yeah. Good luck with that  Big Smile

So are you looking for a specific piece of data with a specific co-ordinate (like you know specifically that there is a piece of data for [1, 2, 3]?  If so, could you create a hash for the specific co-ordinate and use a hashtable for storage and retrieval.

Or are you more like looking for all the data within a bounded range\area?  That's not going to be easy.

Let us know how you get on.

Herbie

Matthew van Eerde
Matthew van Eerde
AKA Maurits

Second the hash suggestion.

W3bbo
W3bbo
The Master of Baiters

What about creating an index for each dimension, then you can do linear-or-better search on the indexes for each dimension to get the desired result; this also allows for finding points in space within a specified radii of the search point too. I assume you could get your RDBMS to maintain the indexes for you.

Unless I'm missing something?

W3bbo
W3bbo
The Master of Baiters

There's a set of hash functions that return similar values for similar inputs. One example is a blur filter; I forget the set's name though.

For thousands of dimensions.... just what kind of data are you processing?

Matthew van Eerde
Matthew van Eerde
AKA Maurits

I'm a little lost now.  You're looking to apply a distance metric over many dimensions?  RMS of the distances across each dimension is the usual answer, but I am getting very curious as to what kind of domain you're working on.

DCMonkey
DCMonkey
Monkey see, monkey do, monkey will destroy you!

Some more general info might still be helpful. Will the set of data involved in the search calculation fit in memory? Is there any pattern to the data that can be taken advantage of? Is some kind of preprocessing step allowed with a lengthly runtime or do the searches need to be fast on fresh raw data? How much hardware can you throw at the problem? Is there an opportunity for parallelization? Is this why you are interested in Hadoop (from that other thread)?

 

Also, by "fast" do you mean seconds, minutes, hours, or days?

page 1 of 1
Comments: 17 | Views: 437
Microsoft Communities