<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" media="screen" href="/App_Themes/default/rss.xslt"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:evnet="http://www.mscommunities.com/rssmodule/"><channel><title>Comment Feed for Regarding the simdization/vectorization of table look-up operations (TechOff on Channel 9)</title><atom:link rel="self" type="application/rss+xml" href="http://channel9.msdn.com/forums/techoff/431977-regarding-the-simdizationvectorization-of-table-look-up-operations/rss/default.aspx" /><image><url>http://mschnlnine.vo.llnwd.net/d1/Dev/App_Themes/C9/images/feedimage.png</url><title>Comment Feed for Regarding the simdization/vectorization of table look-up operations (TechOff on Channel 9)</title><link>http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/</link></image><description>Regarding the simdization/vectorization of table look-up operations</description><link>http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/</link><language>en-us</language><pubDate>Sat, 11 Oct 2008 01:54:04 GMT</pubDate><lastBuildDate>Sat, 11 Oct 2008 01:54:04 GMT</lastBuildDate><generator>EvNet (EvNet, Version=1.0.3599.6114, Culture=neutral, PublicKeyToken=null)</generator><item><title>Re: Re: Re: Regarding the simdization/vectorization of table look-up operations</title><description>It's clearly not possible because of what I said in my previous post - while loading contiguous memory is possible due to the way intel processors agressively cache both page and bank requests, there is no way to stream memory from discontiguous parts of memory - so there is no way to stream two different 64bit addresses to registers under any intel processor, as a consequence of the way memory loads work in the processor.&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;Because these are nonprogrammable there's no way to tell the processor to preemptively load pages to the processor, or to ignore cache requests (believe me, I've tried), so no there is no way to stream 64bit addresses to registers.&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;As I said in my previous post, if you're performing a functional mapping over contiguous memory, then a GPU is your best bet, and CUDA is the best programming language to do it in. If you're performing a generalized lookup over non-contiguous data, then a programmable FPGA is the fastest method.&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;So basically the answer is no, unless you're willing to go rather outside of an what I would be expecting from an undergrad. The question then remains as to _why_ you would be wanting to do such a thing - given that any speed issues are usually due to&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;1. Bad Algorithm&lt;/div&gt;&lt;div&gt;2. Lots of overhead (perhaps due to an overuse of library or OS functions)&lt;/div&gt;&lt;div&gt;3. Bank or cache misses (as detailed earlier)&lt;/div&gt;&lt;div&gt;4. Limitation ultimately at the memory bus speed&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;So now I've repeated what I said before but in a wordier way. The answer is probably no, but it's qualified because there are cases where it isn't no. Happy now?&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;&lt;div&gt;Dexter said:&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt; What you want to be "vectorized" are 4 memory reads at &lt;b&gt;different&lt;/b&gt; addresses&lt;br&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;Are there any examples where you would want to vectorize 4 memory reads at the same address?&lt;/div&gt;&lt;div&gt;mov xmm1, [xmm0]&lt;/div&gt;&lt;div&gt;mov xmm2, xmm1&lt;/div&gt;&lt;div&gt;mov xmm3, xmm1 &lt;/div&gt;&lt;div&gt;mov xmm4, xmm1&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;and the vectorized (using the fact that we get a cache and bank bonus) for contiguous addresses&lt;/div&gt;&lt;div&gt;mov xmm1, [xmm0]&lt;/div&gt;&lt;div&gt;mov xmm2, [xmm0 + 64]&lt;/div&gt;&lt;div&gt;mov xmm3, [xmm0 + 128]&lt;/div&gt;&lt;div&gt;mov xmm4, [xmm0 + 192]&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;has one bank miss if xmm0 is aligned on a 128 byte boundary&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;and for non continguous addresses&lt;/div&gt;&lt;div&gt;mov xmm1, [xmm0]&lt;/div&gt;&lt;div&gt;mov xmm2, [xmm0 + 64]&lt;/div&gt;&lt;div&gt;mov xmm3, [xmm0 + 128]&lt;/div&gt;&lt;div&gt;mov xmm4, [xmm0 + 192]&lt;/div&gt;&lt;div&gt;mov xmm1, [xmm1]&lt;/div&gt;&lt;div&gt;mov xmm2, [xmm2]&lt;/div&gt;&lt;div&gt;mov xmm3, [xmm3]&lt;/div&gt;&lt;div&gt;mov xmm4, [xmm4]&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;has 7-9 bank misses (4 from the direct accesses, one from the leas and 2-4 from the page table directories during the page table misses) and four cache fails, and is a whole lot slower (loading in four addresses stored as size_t* on xmm0)&lt;/div&gt;</description><comments></comments><link>http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432293</link><pubDate>Fri, 10 Oct 2008 16:46:51 GMT</pubDate><guid isPermaLink="false">http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432293</guid><evnet:views>0</evnet:views><evnet:viewtrackingurl>http://channel9.msdn.com/432293/WebViewBug.aspx?EVT=0</evnet:viewtrackingurl><evnet:previewtext>It's clearly not possible because of what I said in my previous post - while loading contiguous memory is possible due to the way intel processors agressively cache both page and bank requests, there is no way to stream memory from discontiguous parts of memory - so there is no way to stream two&amp;#8230;</evnet:previewtext><dc:creator>evildictaitor</dc:creator><slash:comments>0</slash:comments><wfw:commentRss></wfw:commentRss><trackback:ping>http://channel9.msdn.com/432293/Trackback.aspx</trackback:ping></item><item><title>Re: Re: Regarding the simdization/vectorization of table look-up operations</title><description>&lt;P&gt;What YOU are talking about? :)&lt;BR&gt;&lt;BR&gt;Maybe his original post wasn't clear enough but in the second post he clearly mentioned different addresses. Basically he asks about an instruction like&lt;BR&gt;&lt;BR&gt;mov xmm0, [xmm1]&lt;BR&gt;&lt;BR&gt;which could load 2 values in xmm0 from 2 different addresses stored in xmm1. That means doing&amp;nbsp;two 64 bit loads from 2 different addresses. As far as I know current Intel chips can do&amp;nbsp;one 128 bit load from a single address.&lt;/P&gt;</description><comments></comments><link>http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432244</link><pubDate>Fri, 10 Oct 2008 11:42:07 GMT</pubDate><guid isPermaLink="false">http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432244</guid><evnet:views>0</evnet:views><evnet:viewtrackingurl>http://channel9.msdn.com/432244/WebViewBug.aspx?EVT=0</evnet:viewtrackingurl><evnet:previewtext>What YOU are talking about? :)Maybe his original post wasn't clear enough but in the second post he clearly mentioned different addresses. Basically he asks about an instruction likemov xmm0, [xmm1]which could load 2 values in xmm0 from 2 different addresses stored in xmm1. That means doing&amp;nbsp;two&amp;#8230;</evnet:previewtext><dc:creator>Dexter</dc:creator><slash:comments>0</slash:comments><wfw:commentRss></wfw:commentRss><trackback:ping>http://channel9.msdn.com/432244/Trackback.aspx</trackback:ping></item><item><title>Re: Regarding the simdization/vectorization of table look-up operations</title><description>What are you talking about? Intel chipsets come with an onboard (nonprogrammable) hardware page and bank cache, which means so long as you're accessing contiguous memory you ARE vectorising memory reads, whether you like it or not.&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;Moreover, where were you going to vectorize these memory reads to? If you're applying a functional map over an array then to get high levels of parallelism you should look at using a GPU to take the body of work (see CUDA), or if you're looking at a generalized lookup table, there are programmable FPGAs, but if your program is really running slowly, it's probably due to (in order):&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;1. Bad algorithm.&lt;/div&gt;&lt;div&gt;2. Lots of overhead - overuse of standard libraries?&lt;/div&gt;&lt;div&gt;3. Bank or cache misses&lt;/div&gt;&lt;div&gt;4. Memory bus line speed.&lt;/div&gt;</description><comments></comments><link>http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432142</link><pubDate>Thu, 09 Oct 2008 18:59:24 GMT</pubDate><guid isPermaLink="false">http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432142</guid><evnet:views>0</evnet:views><evnet:viewtrackingurl>http://channel9.msdn.com/432142/WebViewBug.aspx?EVT=0</evnet:viewtrackingurl><evnet:previewtext>What are you talking about? Intel chipsets come with an onboard (nonprogrammable) hardware page and bank cache, which means so long as you're accessing contiguous memory you ARE vectorising memory reads, whether you like it or not.Moreover, where were you going to vectorize these memory reads to? If&amp;#8230;</evnet:previewtext><dc:creator>evildictaitor</dc:creator><slash:comments>0</slash:comments><wfw:commentRss></wfw:commentRss><trackback:ping>http://channel9.msdn.com/432142/Trackback.aspx</trackback:ping></item><item><title>Re: Re: Regarding the simdization/vectorization of table look-up operations</title><description>Dexter, I am dealing with 64bit numbers, this will only involve 2 memory reads at different addresses until 2010 when Intel comes out with 256bit ymm registers.&lt;br&gt;&lt;br&gt;Anyway, I was thinking about this yesterday after I made my post. If such an instruction existed, I assume it would fall back to a serial retrieval if the information is not stored in cache, while it could do a parallel retrieval if the information is in cache. In my case, the lookup table is small, as it has 64 entries that only require 6 bits of storage each, although I am storing them so they use 64bits each. It should be possible to store them in 64 bytes, which should easily fit on a cache line.&lt;br&gt;&lt;br&gt;Also, another possibility would be to decrease the bus width by a factor of 4 and increase the clock speed by a factor of 4, which would allow the CPU to retrieve 32-bit blocks of data from up to 4 different memory locations (perfect for SIMD) in the same time it would have retrieved one 128-bit block of data, while it would still be able to retrieve one 128-bit block of data in the same time that it would have with the current 128-bit bus and clock rate. Of course, this is the Rambus approach to doing things, so it is unlikely it would ever be adopted in conventional computers.&lt;br&gt;</description><comments></comments><link>http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432059</link><pubDate>Thu, 09 Oct 2008 14:06:58 GMT</pubDate><guid isPermaLink="false">http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432059</guid><evnet:views>0</evnet:views><evnet:viewtrackingurl>http://channel9.msdn.com/432059/WebViewBug.aspx?EVT=0</evnet:viewtrackingurl><evnet:previewtext>Dexter, I am dealing with 64bit numbers, this will only involve 2 memory reads at different addresses until 2010 when Intel comes out with 256bit ymm registers.Anyway, I was thinking about this yesterday after I made my post. If such an instruction existed, I assume it would fall back to a serial&amp;#8230;</evnet:previewtext><dc:creator>Shining Arcanine</dc:creator><slash:comments>0</slash:comments><wfw:commentRss></wfw:commentRss><trackback:ping>http://channel9.msdn.com/432059/Trackback.aspx</trackback:ping></item><item><title>Re: Regarding the simdization/vectorization of table look-up operations</title><description>I'm not an expert in SIMD&amp;nbsp;but I'll try to guess:&lt;BR&gt;&lt;BR&gt;- What you call table look-up is basically indexed memory access&lt;BR&gt;- What you want to be "vectorized" are 4 memory reads at &lt;STRONG&gt;different&lt;/STRONG&gt; addresses&lt;BR&gt;&lt;BR&gt;To vectorize some arithmetic or logical operation is easy, you put N ALUs in the CPU and you have them executing the same operation on different&amp;nbsp;bit ranges&amp;nbsp;of the input registers.&lt;BR&gt;&lt;BR&gt;To vectorize memory reads what do you need? Well, I assume you need N memory access units, N buses and a N-ported memory. Tough job having all those and if you don't have them you'll end up serializing all memory reads. Maybe it could work if all reads would be on the same cacheline but that doesn't always happen.&lt;BR&gt;&lt;BR&gt;</description><comments></comments><link>http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432039</link><pubDate>Thu, 09 Oct 2008 11:19:30 GMT</pubDate><guid isPermaLink="false">http://channel9.msdn.com/forums/TechOff/431977-Regarding-the-simdizationvectorization-of-table-look-up-operations/?CommentID=432039</guid><evnet:views>0</evnet:views><evnet:viewtrackingurl>http://channel9.msdn.com/432039/WebViewBug.aspx?EVT=0</evnet:viewtrackingurl><evnet:previewtext>I'm not an expert in SIMD&amp;nbsp;but I'll try to guess:- What you call table look-up is basically indexed memory access- What you want to be "vectorized" are 4 memory reads at different addressesTo vectorize some arithmetic or logical operation is easy, you put N ALUs in the CPU and you have them&amp;#8230;</evnet:previewtext><dc:creator>Dexter</dc:creator><slash:comments>0</slash:comments><wfw:commentRss></wfw:commentRss><trackback:ping>http://channel9.msdn.com/432039/Trackback.aspx</trackback:ping></item></channel></rss>