Entries:
Comments:
Posts:

Loading User Information from Channel 9

Something went wrong getting user information from Channel 9

Latest Achievement:

Loading User Information from MSDN

Something went wrong getting user information from MSDN

Visual Studio Achievements

Latest Achievement:

Loading Visual Studio Achievements

Something went wrong getting the Visual Studio Achievements

Joe Duffy, Huseyin Yildiz, Daan Leijen, Stephen Toub - Parallel Extensions: Inside the Task Parallel

Download

Right click “Save as…”

  • Mid Quality WMV (Lo-band, Mobile)
  • MP3 (Audio only)
  • MP4 (iPhone, Android)
  • High Quality MP4 (iPad, PC, Xbox)
  • Mid Quality MP4 (Windows Phone, HTML5, iPhone)
  • WMV (WMV Video)
Joe Duffy, Huseyin Yildiz, Daan Leijen, Stephen Toub and I gathered in a conference room in building 122 to dig into the Task Parallel Library infrastructure. You've heard about the Parallel Computing Platform a few months ago in an interview with Anders Hejlsberg and Joe Duffy. We didn't go too deep in that talk. It was an introduction to the Parallel Computing Platform.

Here, we take a dive down into the technical rabbit hole with Daan, Joe, Stephen and Huseyin.

Daan is an MSR reseacher who's work has been instrumental in bringing TPL and Parallel Extensions to life. Of course, Joe is the guy who invented PLINQ (he wrote the original Think Week paper that impressed Bill) and is a lead developer on the Parallel Computing Platform team. Stephen is the Program Manager (and is the one driving and scheduling many of the interviews you will see covering Parallel Computing Platform here on C9 - Thanks, Stephen!) and Huseyin is a developer who recently joined the group and is already making a big impact.

Most of the time here is spent on the whiteboard with Daan. Make some time for this conversation. There's an awful lot to learn here.

Enjoy!

Click here for the low res download.

Tags:

Follow the Discussion

  • evildictaitorevildictait​or Devil's advocate
    Oooh. The last few minutes are ever so interesting. I'm not so sure that compilers are as far off as you think when you say that automatic parallelism is far away. For languages like C and C++ it's going to be a very long time before any useful autoparallelization happens, but for languages like Haskell, F# and C# quite a lot can be done (and is being done) to automatically parallelize code, and there's some interesting theorems as to the upperbound of automatic parallelisation that compilers can do, and it's not far off the theoretical optimum (and equal for safe, managed code).
  • Charles,

    You brought up a great point\question that has been on my mind for awhile.

    Why not add some keyword or attribute to the language itself to further enhance parrelellism?

    I can't remember his name but he replied that they are looking into this level of language integration.  I'd love to see a video on that topic.

  • evildictaitorevildictait​or Devil's advocate
    MetaGunny wrote:
    Charles,

    You brought up a great point\question that has been on my mind for awhile.

    Why not add some keyword or attribute to the language itself to further enhance parrelellism?

    I can't remember his name but he replied that they are looking into this level of language integration.  I'd love to see a video on that topic.



    Noooo! Leave C# alone! One of the reasons C# is nice and easy to learn is that it's core keyword set is quite compact. If you start adding lots of keywords here there and everywhere the language gets out of hand quickly.

    On the other hand, there's no reason why a compiler shouldn't be able to take the same keyword "for" to mean both the normal syncronous meaning when the for-body is expensive, and the Parallel.For() when the body can be easilly split into asyncronous tasks and have this as an option inside the build menu.
  • John Melville-- MDJohn Melville-- MD Equality Through Technology

    Has anyone considered a "debug" switch to run with n queues, regardless of how many processors are availaible?  I see a bad scenerio comming.

    1. Developer is on a 1-2 core box, or a many core box with enough other junk (VS, virus scanner, outlook, IE, whatever else) so that unbeknownst to the programmer, there is no real parallelism.

    2.  Dev signs off on the code that has never been tested running truly parallel.

    3. Code gets loaded on the mega-beheamoth 16 core production machine that's not running anything else.

    4. Code runs in parallel for the first time in production.  Or even worse its shrink-wrap software; and next year when bigger machines come out, more parallelism is exposed and the software starts failing randomly.

    Most developers would consider #4 to be a very bad thing, but I find it inevitable.  My dev box is usually running 3-6+ apps when I'm developing, and therefore most of my testing.  If I can't force the code to be parallel in testing, the interleaved paths might get very little coverage, and it will be very hard to know this is happening.

    Just curious if this has been thought about.

  • I'm using Silverlight player, there is no sound for this video, what's up Charles?
  • Curt NicholsCurt Nichols No Silver Bullet
    John Melville, MD wrote:
    

    1. Developer is on a 1-2 core box, or a many core box with enough other junk (VS, virus scanner, outlook, IE, whatever else) so that unbeknownst to the programmer, there is no real parallelism.



    Nothing is going to be able to stop the developer from writing bad code or failing to test and profile code.

    That said, developers need tools that help profile and illustrate an application's behavior, particularly for data parallelism. We do have some access to these things today--we can pretty easily look at numbers for context switches, cache line loads, etc.--but it has a pretty ad hoc feel to it. It would be great to have tools that target profiling parallel code and, you know, dumb it down for the rest of us. Smiley Intel's Thread Profiler does a great job of visualizing core usage, but to my recollection it only works on native code.

    John Melville, MD wrote:
    

    3. Code gets loaded on the mega-beheamoth 16 core production machine that's not running anything else.

    4. Code runs in parallel for the first time in production.  Or even worse its shrink-wrap software; and next year when bigger machines come out, more parallelism is exposed and the software starts failing randomly.



    Untested synchronization code will always be a risk, whether it's used in a simple multi-threaded program or one that attempts parallelism. (Yes, I said "attempts." Tongue Out)

    Scalability is going to be another lurking demon. Does your program scale? It's not just a question of machine resources, or of whether your algorithms and locking code scale.

    Hardware architecture also affects whether your application scale. E.g., a 2-socket, 4 core computer will behave differently than a 1-socket, 4 core computer. Your application may work better with one or the other (particularly depending on its cache usage patterns).

    More than that, you must test on the target hardware. I could tell you how software I've written scales on 4- and 8-core computers of differing architectures, but I can't and won't promise anything about its performance on a 16-core box, because I've never had access to one. For all I know the algorithms or locking could bring it to its knees. For that matter, I/O could become the bottleneck. I won't know unless I can profile it on that particular hardware.

    This last bit is, I think, the most likely outcome of your "naive programmer" scenario above. The developer codes it up on a 4-core box and it runs pretty well and scales nicely from 1 to 4 cores. Move it to a 16-core box and it runs half as fast. Seriously. Stay in this business a few years and you will see this happen. A lot.

    Perhaps this is just another argument for configurable core usage.

    Looking forward to watching the video, looks like it should be good. Smiley
  • Christian Liensbergerlittleguru <3 Seattle
    Brainstorming here:

    Wouldn't here

    try
    {
        Parallel.For(0, n, () => ... throws exceptions);
    }
    catch(exception.Contains(typeof(FooException)) ||
       exception.Contains(typeof(BarException))
    {
        // handle the exception
    }


    come in handy for C#? Why am I asking this? I see a lot of people starting to handle exceptions like this:

    try
    {
        // some parallel code that throws exceptions.
    }
    catch(AggregateException)
    {
        // do something
    }

    where they would also handle OutOfMemoryException and such - or are such critical exceptions never aggregated?
  • ivan_ivan_ g

    to #4

    I understand you can have many apps running on your dev box, but as long as processor core utilization is not 100% you should be able to successefully schedule your tasks on that core. It is similar to load balancing technique or a time compression utilization. So your parrallel code will experience parallel run-time environment before production, it could be slower though, so what.

  • CharlesCharles Welcome Change
    Ion Todirel wrote:
    I'm using Silverlight player, there is no sound for this video, what's up Charles?


    I'm not experiencing this problem. What happens when you use the old player (note that it's the same exact file in each player....)?
    C
  • At 15:50 into the video, JoeDu begins a discussion regarding optimizations with respect to heap vs. stack allocation.

    While I think I have a basic understanding of the tradeoffs between heap and stack allocation, if possible I'd like to hear JoeDu give his perspective.

    Joe, can you comment?

    -AlfredBr
  • evildictaitor,

    I do think Daan overstated the point (perhaps intentionally) about automatic/implicit parallelism.  It is true that many kinds of computations can be automatically run in parallel with little-to-no input from the developer.  When might this be possible?  When a computation is guaranteed not to have side-effects and thread-affinity.

    This already commonly applies to specialized frameworks and domain-specific languages.  Big hammer APIs like parsing an XML document or compressing a large stream of data also immediately come to mind.  Functional programming as a broader class of automatically parallelizable computations is an interesting one, but is not a silver bullet.  Mostly-functional languages are more popular than purely-functional ones; F# and LISP, for example, permit "silent" side-effects buried within otherwise pure computations, which means you can't really rely on their absence anywhere.

    Haskell and Miranda are two examples from a very small set of purely functional languages, where all "silent" imperative effects are disallowed, but for certain type system accomodations (monads), in which implicit parallelism is possible.  This allows you to at least know when parallelism might be dangerous, and it's the exception rather than the rule.  But even here, many real-world programs are constrained by data and control dependence.  You might be interested in John DeTreville's brief case study on this fact: http://lambda-the-ultimate.org/node/1948.

    Nevertheless, implicit and automatic parallelism are clearly of interest to researchers in the field.  I think what Daan was trying to say is that we're still a few years away from having a more general solution.  Between now and then, however, I would expect to see some specialized frameworks providing this; heck, just look at MATLAB and SQL for examples where this has already succeeded.

    Regards,
    ---joe
  • littleguru,

    Your proposed syntax relies on the 1st pass of SEH, but can be written directly in IL or VB (since they support filters).  C# doesn't support them and, to be honest, I'm glad they don't.  We did consider this model to make AggregateExceptions more palatable, but for various reasons we don't think it would make a huge difference.  Moreover, the 2-pass model of SEH is problematic and so we would prefer not to embellish it.

    I should restate a point from the video: we encourage that developers to the best of their ability prevent exceptions from leaking across parallel boundaries.  Life simply remains a lot easier.  Once the crossing is possible, you need to deal with AggregateExceptions which is a bit like stepping through a wormhole:  you end up in a completely different part of the universe with little chance of getting back to your origin.

    The real issue is that with one-level deep examples like the one you show, you can certainly figure out how to pick out the exceptions you care about, handle them, etc.  We even offer the Handle API for this:

    try {
        ... parallel code ...
    } catch (AggregateException ae) {
        ae.Handle(delegate(Exception e) {
            if (e is FooException) {
                ... handle it ...
               return true;
            }
            return false;
        });
    }

    If, after running the delegate on all exceptions, there are any for which the delegate returned 'false' (i.e. unhandled), Handle rethrows a new AggregateException.  I admit, this code is a tad ugly, but even with 1st pass support you'd have to do something like this.  (Unless SEH knew to deliver only the exceptions that were chosen in the 1st pass selection criteria, which would require yet more machinery.)  But the issue is, what if you handle some FooExceptions, but leave some BarExceptions in there?  Again, those up the callstack will see AggregateExceptions and will need to have known to write the whacky code I show above.

    All of this is really to say that AggregateExceptions are fundamentally very different.  Exceptions in current programming languages are, for better or for worse, very single-threaded in nature.  They assume a linear, crawlable callstack, with try/catch blocks that are equipped to handle a single exception at a time.  I can't say I'm terribly happy with where we are, but I can say I think it's the best we can do right now given the current world of SEH.

    ---joe
  • Hi, sorry if this sounds naive, but what if you want to read/parse multiple files from disk in parallel using the TPL? Any test done by anyone? My guess would be we need new Async IO features (including Async File Open) that can be combined with this library to make such a scenario perform.

    As for the discussion versus fixed-size problems versus variable-size problems (with varying amount of data): As an engineering team, you can score by making your software scaleable by using more cores for more data.  

    In my line of work, there are always customers that have 2-4 times more data than the rest, and the same expectations on performance. If you use TPL to do your data in parallel, you can tell him to go buy an extra core... And he will love to hear that, because he now has the ultimate excuse to have his boss buy him extra horsepower.

  • Christian Liensbergerlittleguru <3 Seattle
    joedu wrote:
    littleguru,

    ...

    I should restate a point from the video: we encourage that developers to the best of their ability prevent exceptions from leaking across parallel boundaries.  Life simply remains a lot easier.  Once the crossing is possible, you need to deal with AggregateExceptions which is a bit like stepping through a wormhole:  you end up in a completely different part of the universe with little chance of getting back to your origin.

    ...

    ---joe


    I very much understand your reasons for the AggregateException. I also think that it is better to catch the exception inside the Task, if possible.

    I only asked the question because there are a lot of people around who don't read the guidelines and break them. That's then when it gets complicated and you have to tell people for the 100th time (in forums or whereever) that they should do it otherwise.

    What I think is that exceptions should always walk the stack when not catched explicitly and with the AggregateException some could get swallowed. But I know that with the current implementation of the CLR, the SEH and C# you can't come up with something that isn't much better.

    Btw. I like the Handle delegate because at least with one method call you make sure that the unhandled exception(s) get(s) thrown again. Still it is a method that somebody needs to invoke and not automatic but better than nothing Smiley

    I like the PFX framework a lot! It's really an awesome work that you people did with it.
  • Here is my suggestion for a subject for part 2.

    PLINQ, and Parallel.For and so on are good for computers with lots of CPUs but what would also be good is if we could use the same language to access the many processors on a graphics card.

    At the moment there are special shader languages such as NVidias cg langauge to write shaders.

    Can PLINQ take advantage or encapsulate the parallism on graphics cards or even multiple channels on sound cards? I think this would make writing shaders very easy and they could be implemented either on multiple cores or on a graphics card.

    So could you ask about that and also whether you are concentrating only on multiple CPUs because this may do away with the need for a dedicated graphics card in the long term?Big Smile

    When you are designing PLINQ and so on do you mainly have database and data programming and business software in mind or graphics and games too?

  • Nice video.. for part 2, could you show us a similar example for shared state with bad side effects when using PLINQ ..but something a little more involved than x++?

    And a shameless plug - you can get a transcript and an index for this long screencast at my blog post here.

  • William Staceystaceyw Before C# there was darkness...
    evildictaitor wrote:
    
    MetaGunny wrote:
    Charles,

    You brought up a great point\question that has been on my mind for awhile.

    Why not add some keyword or attribute to the language itself to further enhance parrelellism?

    I can't remember his name but he replied that they are looking into this level of language integration.  I'd love to see a video on that topic.



    Noooo! Leave C# alone! One of the reasons C# is nice and easy to learn is that it's core keyword set is quite compact. If you start adding lots of keywords here there and everywhere the language gets out of hand quickly.

    On the other hand, there's no reason why a compiler shouldn't be able to take the same keyword "for" to mean both the normal syncronous meaning when the for-body is expensive, and the Parallel.For() when the body can be easilly split into asyncronous tasks and have this as an option inside the build menu.


    Funny you say that.  During that part, I was thinking it would be cool to be able to add keywords manually using some kind of Extension method deal.  That way, people could experiment back and forth on syntax ideas or just use in their namespaces.  New keywords or overrides (with supporting libraries) could be a plug-in model to VS.  That way, c# proper stays clean, but could be extented and experimented with. 
  • FOR PART 2:

    1) Grid computing has the same challenges as multi-core computing, adding in extra overhead and remoting/serialization requirements. Windows HPC Server 2008 will provide a job manager that can distribute jobs over a network with WCF. Digipede Networks offers a similar solution. What is the feasibility of doing the same thing with the TPL and what potential hurdles must be overcome?

    2) What about TPL + Silverlight 2.0?

    3) What about Volta + Silverlight + TPL for #1?



  • Christian Liensbergerlittleguru <3 Seattle
    staceyw wrote:
    Funny you say that.  During that part, I was thinking it would be cool to be able to add keywords manually using some kind of Extension method deal.  That way, people could experiment back and forth on syntax ideas or just use in their namespaces.  New keywords or overrides (with supporting libraries) could be a plug-in model to VS.  That way, c# proper stays clean, but could be extented and experimented with. 


    You probably need to implement some preprocessor that does such stuff for you... but wait, wouldn't that mean we are back at the age of macros. Oh noes!

    As for part 2: Please go into more details on F# and the PFX. Where and how do they collaborate? Smiley
  • I would also like to know about that. Could future implementations of TPL be used for example to make a distributive project for example like the one people downloaded to model climate change? Because this is an example of parrallism in which the CPUs are distributed on computers worldwide. Then either the program could be run on a single computer with multiple cores or distributed on multiple computers in a local grid or distributed to different computers on the internet. What are your thoughts on this?

    This is the way I see computers going where it doesn't matter where the CPUs are as long as they are joined together in some way you can run parrallel programs on them.

  • evildictaitorevildictait​or Devil's advocate
    staceyw wrote:
    
    Funny you say that.  During that part, I was thinking it would be cool to be able to add keywords manually using some kind of Extension method deal.  That way, people could experiment back and forth on syntax ideas or just use in their namespaces.  New keywords or overrides (with supporting libraries) could be a plug-in model to VS.  That way, c# proper stays clean, but could be extented and experimented with. 


    Although I understand where you're coming from, this idea would lead to source code becoming "locked" to a machine and IDE, which would make life very difficult when giving your source to someone else (say a colleage) or via a team collaboration unless very strict permissions of who can add new keywords were introduced.

    All in all, it sounds like the type of thing that would only be fully understood or helpful to language developers and a small number of hobbyists, but could significantly damage both the reputation and workability of C# in general, when those same hobbyists and language developers could just as easilly use a function.

    All in all I think that would be a bad move for C#.
  • DaveNodererDaveNoderer DaveNoderer

    Starting to fool around with tpl and did a small program to investigate IO. Somewhat real in that I'm working on a purge of old files (.eml's) for a customer.

    I'm copying ~ 5000 files to a directory then deleting them. Not very interesting but close to what the customer needs done.

    I found that with the sequential loop (there are other ways in system.io to do this besides a loop!) it took ~ 11 seconds and the paralelll loop  took 6 seconds on my dual core laptop.

    I don't have a 4 core machine, I suspect that adding a third processor would not help much. I'm assumng that the first thread blocks for I/O then the second thread can queue up another I/O and back and forth but having 3 or 4 processors would not necessarily do any better.

    The copy takes most of the time.

    For Each FI In DirSrc.GetFiles
      FI.CopyTo(Path.Combine(Me.FilePath, FI.Name))
    Next

    vs.

    Parallel.ForEach(DirSrc.GetFiles, Function(Fix) Fix.CopyTo(Path.Combine(Me.FilePath, Fix.Name)))

    Which brings up the subject of good tools to see what is really happening...

  • Great vid. I would like to see if possible a demo on how the scheduling from the library works along with the OS scheduling in Windows. Is it possible that you override whatever the OS does with each processor?
  • William Staceystaceyw Before C# there was darkness...
    evildictaitor wrote:
    
    staceyw wrote:
    
    Funny you say that.  During that part, I was thinking it would be cool to be able to add keywords manually using some kind of Extension method deal.  That way, people could experiment back and forth on syntax ideas or just use in their namespaces.  New keywords or overrides (with supporting libraries) could be a plug-in model to VS.  That way, c# proper stays clean, but could be extented and experimented with. 


    Although I understand where you're coming from, this idea would lead to source code becoming "locked" to a machine and IDE, which would make life very difficult when giving your source to someone else (say a colleage) or via a team collaboration unless very strict permissions of who can add new keywords were introduced.

    All in all, it sounds like the type of thing that would only be fully understood or helpful to language developers and a small number of hobbyists, but could significantly damage both the reputation and workability of C# in general, when those same hobbyists and language developers could just as easilly use a function.

    All in all I think that would be a bad move for C#.


    Yeh.  The IDE thing would not be good.  However the Extention method would still work as the code is included, so possible it would all be handled at the compiler level.  Kinda a keyword by Attribute/code deal.
  • Christian Liensbergerlittleguru <3 Seattle
    staceyw wrote:
    Yeh.  The IDE thing would not be good.  However the Extention method would still work as the code is included, so possible it would all be handled at the compiler level.  Kinda a keyword by Attribute/code deal.


    What are the benefits by adding this? Why not just use a method call instead? Wouldn't this get complicated (speaking of nesting custom keywords)? Also, how do you think should the syntax look exactly?
  • Nice Talk... , its nice to see the amount of effort being put in to it....
    Next one as said is with Don Syme...

    Just can't wait for that...(Don is great)
    F# is very cool and is my favourite language with C#..

    My Question for Don Syme is when we ll have WinForms/WPF designer support for F# in Visual Studio...
    Also i won't mind alot of talks about F# in next video...

    Why Functional language is important...
    How F# can solve the problems of Multi Core/Parallelism.., talk about F# async workflows, asynchronus I/O.....
    Any upcoming new feature details...+ when the F# will get Productized ...full details or whats the target....

    Alot of questions about F# and How much will Microsoft push it in to mainstream..
    I kind of think it as very immense but still very much general purpose language...along with its heavy use in computation work or financial analysis...or etc..

    Also i would recommend others to read out fantastic book "Expert F#"..........
    http://www.amazon.com/Expert-F-Experts-Voice-Net/dp/1590598504
  • William Staceystaceyw Before C# there was darkness...
    littleguru wrote:
    
    staceyw wrote:
    Yeh.  The IDE thing would not be good.  However the Extention method would still work as the code is included, so possible it would all be handled at the compiler level.  Kinda a keyword by Attribute/code deal.


    What are the benefits by adding this? Why not just use a method call instead? Wouldn't this get complicated (speaking of nesting custom keywords)? Also, how do you think should the syntax look exactly?


    You would have real lang extentions for one thing.  Powershell does this with functions.  You could define lang blocks (i.e. for(){}, lock{}) that you can't do with methods.

    Something like:

    public static function MyFor(int start, func<int, bool> test, func<int, int> last)
    {
        // do either Parellel for or normal for using a global switch.
    }

  • For those who like language extensions: you may like an extensible language instead:

    http://boo.codehaus.org/Syntactic+Macros
  • He guys, great show. Thanks.

    One suggestion though: the fibonachi function is not really the best example for parallel computations. The fibonachi is a  typical sequential funtion: F(n) depends on all F(m) m<n and the fastest way to compute is function is to start at F(0) and then calculate F(1), F(2) etc. This can be done in O(n).

    The "parallel" example that was given in the show is O(n^2).

    A parallel implementation will always be slower than a smart sequential implementation. (this is a challenge Wink

    I will start using the TPL on an HPC academic project and will share my findings if applicable.

    Kind regards,
    Martijn Kaag

    http://www.kaag.eu
  • Did anyone try to benchmark these TPL examples?
    I've tried the PFib function from the video and its parallel version is several times slower than its sequential sibling on Dual Core XP machine, December CTP TPL. The console app created 29 threads, and used huge chunk of memory.

    Also Parallel.For was 1.5 - 2 times slower on simple bulky array operations.

    Do I need at least QCore to see advantages?
  • Well, at least one parallel test was indeed faster, by factor 1.5 (on matrix sizes bigger than 2000x2000):

            private static double[,] MultP(int w, int h)
            {
                double[,] m1 = new double[w, h];
                Parallel.For(0, w, x =>
                {
                    Parallel.For(0, h, y =>
                    {
                        m1[x, y] = x * y * 1000.0;
                    });
                });
                Parallel.For(0, w, x =>
                {
                    Parallel.For(0, h, y =>
                    {
                        m1[x, y] = Math.Sqrt(m1[x, y]);
                    });
                });
                return m1;
            }


    May be enclosed FORs are easier to parallelize. Similar vector tests are 1.5 - 2 times slower.

    Something is completely wrong with PFib case. TPL is 90 times slower.
    Memory consumption is an issue with all parallel tests.

  • boomeranda wrote:
    Did anyone try to benchmark these TPL examples?
    I've tried the PFib function from the video and its parallel version is several times slower than its sequential sibling on Dual Core XP machine, December CTP TPL. The console app created 29 threads, and used huge chunk of memory. <= it's not because of PFib

    Also Parallel.For was 1.5 - 2 times slower on simple bulky array operations.

    Do I need at least QCore to see advantages?


    Obviously the overhead from switching threads is too high for such a simple task. After adding more demanding calculations to PFib, TPL's advantage becomes obvious.

    Good job TPL team!
  • How much parallelism and optimization can be done by the compiler automatically?
    Daan points out that even with code with no side-effect, the question of the overhead remains.
    I think this can be solved by an approach like PGO (Profile Guided Optimization). In other words, run the program with some instrumentation to try out the different parameters and see what benefits they bring, then re-compile the program with those decisions and optimizations built-in.

Remove this comment

Remove this thread

close

Comments Closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums,
or Contact Us and let us know.