I had to implement something similar not too long ago. Had to also do % matching as well. In my case though, I couldn't change the underlying database and I was lucky enough to be able to narrow results down to around 100 or so before having to calculate
it. I ended up settling on a combination of Levenshtein and Double-Metaphone.
If you can add a column to the underlying table, then you might want to look at just calculating the double-metaphone and storing the hashes it creates in a couple columns. Then you can compare on the double-metaphone hashes in SQL vs. calculating on the fly.
Here's an article where a guy did it and stored the hash as a UDT.
I had to implement something similar not too long ago. Had to also do % matching as well. In my case though, I couldn't change the underlying database and I was lucky enough to be able to narrow results down to around 100 or so before having to calculate it. I ended up settling on a combination of Levenshtein and Double-Metaphone.
Well, I think you'd see a speed increase since you are reducing the instructions in an inner loop there, but not on the order of magnitude that you're looking for.
My guess is that you'd see a slight increase in performance.
A better approach is to run a profiler against your code and see where it is actually running slow and then optimize from there.
Any reason why you are using that collation? If you are stuck with that DB collation, why not add a COLLATE clause to the query?
SELECT TOP(10) keyfield FROM #Test WHERE keyfield >= @Search2 COLLATE Latin1_General_CI_AI;
The reason why is found in BoL:
In SQL Server 2005, you should primarily use Windows collations. This is particularly true if you have a mix of Unicode and non-Unicode columns in your database. Windows collations actually apply Unicode-based sorting rules to both Unicode and non-Unicode data. This means that SQL Server internally converts non-Unicode data to Unicode to perform comparison operations. This provides consistency across data types in SQL Server and also provides developers with the ability to sort strings in their applications that use the same rules that SQL Server uses.
SQL collations, on the other hand, apply non-Unicode sorting rules to non-Unicode data, and Unicode sorting rules to Unicode data, by using a corresponding Windows collation for the Unicode data. This difference can cause inconsistent results for comparisons of the same characters. Therefore, if you have a mix of Unicode and non-Unicode columns in your database, they should all be defined by using Windows collations so that the same sorting rules are used across Unicode and non-Unicode data.
You should use SQL collations only to maintain compatibility with existing instances of earlier versions of SQL Server or to maintain compatibility in applications that were developed by using SQL collations in earlier versions of SQL Server.
Also, Latin1_General_Bin has specific rules due to legacy behavior which is why Latin1_General_Bin2 was introduced.
I'd recommend you buy a good SQL Server book and read through it. It should explain a lot of the DB performance issues to you. One thing to remember is that SQL works best when doing things in Sets vs. One-at-a-time.
First and foremost, look at where your performance is bottlenecking (SQL Profiler is where you should start out). For a 20MB database, you shouldn't be having the problems you describe.
Some things I'd suggest you look at first:
If you are doing a lot of inserts/deletes, remember that those are LOCKING the table. So SELECTs by default will wait for the LOCK (unless you are using the NOLOCK hint or set your TRANSACTION ISOLATION READ UNCOMITTED).
Are you reindexing when you insert/delete a bunch of rows (look at percentage of table as an indicator here). If you are deleting all the rows from the table regularly, think abour DROPing the table vs. DELETEs (this can help fragmentation).
Also, if you are inserting a bunch of rows into an empty database, think about dropping the non-clustered index before the inserts and then re-creating the non-clustered indexes*
If you are doing a lot of insert/deletes in a non-clustered indexed table, that will cause all sorts of page fragmentation and performance will suffer over time.
Clustered Indexes vs. Heaps:
As far as clustered indexes go, it comes down to how SQL stores things internally.
When you create a table without any index, it is be default a "heap". When you insert rows, it is stored internally in whatever order (typically the order it was inserted or if a deletion occurred, it can "fill in the hole").
When you create a clustered index for a table (you can only have one per table). Data is stored in the order of the primary key on the clustered index. For performance reasons, you typically want the primary key to be a unique monotonically increasing value (for example an IDENTITY column). **
So, getting back to things a little bit, depending on how WIDE your table is and how many columns you want in your result set -- having a Clustered Index SEEK in your query can be a good thing. When you seek, it will read the entire record from disk. (if you only need one or two columns and your table is 200 columns wide, then there are better options ***).
A non-clustered index is like a lookup table. It stores the primary keys and then a pointer to the Clustered Index or Heap Index where the full record resides. Non-clustered indexes are where you will typically do query optimizations, but for each index you add to a table, it will slow down insert/update/delete performance (since SQL has to do more work). So it can come down to being a balancing act. Also, it is important to remember that you can have multiple columns as your key in the index. Order does matter in multi-column indexes and for performance you would want the most unique column to be first in the list (Example, city vs. state in address records).
So, lets get back to query analyzer for a second:
Ideally, you want your queries to be SEEKs and order of performance generally goes like this (best performing to lowest)
Clustered Index Seek/Index Seek (Depending on table width and rows returned)
Clustered Index Scan
Now, generally it is REALLY BAD to try and force the query to use any particular index. This goes doubly so if you don't know what you are doing. The SQL engine usually does a really good job at optimizing queries. I highly recommend not trying to use INDEX HINTS in your queries.
So, it comes down to "How do you get a query to do index seeks?"
Well, there are lots of rules around this, but a general guideline is you want your where clause to be around "=" vs. "<>".
If you think about indexes as lookup tables. It is easier to look for existence than non-existance (<> will have to look through the entire index to see if it is there thus an index scan).
Also, avoid the "OR" statement. This is almost a guaranteed index scan. If you have a WHERE clause that looks like:
WHERE LastName = 'Jones' OR LastName = 'Smith'
you might want to look at rewriting it to:
WHERE LastName IN ('Jones', 'Smith')
The LIKE statement is "Tricky".
Using LIKE 'Smith%' should result in index seeks.
Using LIKE '%Smith%' will result in an index scan.
Watch out for CONVERT/CAST
SQL can be smart about CONVERT/CAST on operators, but you can cause an index scan to occur because of them. The reason being that the CONVERTED/CAST data isn't in the index. If you use CONVERT/CAST, do it on the non-table data.
Using CONVERT(VARCHAR(30), table.dtDue) = '1/31/2008' will result in Index Scans
Using table.dtDue= CONVERT(DATETIME, '1/31/2008') will result in Index Seeks
Note: SQL can be smart about conversions.
Using table.dtDue = '1/31/2008' will also result in Index Seeks
*Note: This is a quicker operation because it is working across a set of data instead of individually when the inserts occur. Another option is to do your data loading into a TEMP table and then do an INSERT from the temp table into the actual table. (Think Sets vs. Individual)
**Note: a GUID is actually a typically bad choice because it is random, not monotonically increasing causing paging issues, thus why the NewSequentialID() function was introduced for SQL 2005.
***Note: This is where a "Covered Index" comes into play. If you want to learn more, I suggest searching for that term.
I really don't think it is an OS issue (I could almost guarantee it). My guess is that it is a difference in the compilers. I'm not sure how much if any code was changed to port GMP to Win32 but there could definitely be something going on there.
Any chance you could post the code somewhere for people to take a look at and see if they are seeing the same thing you are?
As far as anyone from the Visual Studio team responding... I'm not sure if anyone from those teams visits Channel 9. So I wouldn't count on anyone from there replying to this thread. (It is possible, just not likely IMO)
I see that you are linking to GMP in Linux, what math library are you using in Windows? Are you using a port of GMP for Windows? If so, which port? Where did you get it from and how was it compiled?
Since computing Mersenne Primes with the Lucas Lehmer test is pretty FFT/iFFT intensive the quality of whatever math library you are using is key.
Seeing as how Prime95 for windows and GLucas for linux/mac both seem to run about the same speed, (granted they both use highly optimized assembly for the FFTs) I highly doubt that Windows is making things slower. (See http://www.mersenne.org/)
You claim to be comparing apples to apples, but I highly doubt it.
Hmm... Only problem I can see offhand (and I tested it to be sure just in case) was that you didn't use:
In your C function definitions which left the function calls decorated. By adding that everything seemed to work fine using:
[DllImport(@"c:\test.exe", EntryPoint = "myText")]
If I left the extern "C" off and used the decorated name (via dumpbin) that also worked.
[DllImport(@"C:\test.exe", EntryPoint = "?myText@@YAXPAD@Z")]
So it looks like your problem was more in the decorated names in C++ than anything.
You can read more about it here:
Took all of 30 seconds. You have an almost answer in the question.
Just to play devils advocate for a minute here...
What is to say that someone couldn't hang one of these up with some det-cord in it in the same places? From what I understand, these were devices with LEDs and batteries -- a evildoer could easily put some detcord around the edges there and have a pretty potent bomb... since these were hung up on bridges and whatnot -- it could be seen as a threat.
Yes -- they ended up being lightbrites this time, but when funky looking signs start showing up on bridges in my city -- I would expect the city to investigate. And let's not forget that Boston was an important city in that 2 of the flights on 9/11 came from Boston Logan airport...
Not to say that the govt didn't overreact, but it was a pretty stupid stunt.