Posted By: yman | Aug 19th, 2007 @ 9:50 AM
page 1 of 1
Comments: 16 | Views: 7930
First off, I must admit I am not a maths guy, so please be gentle if I am talking a load of rubbish Smiley

I am working on a new community social networking site (not another one you say? Don't worry, it's quite original and all will be revealed in time.) which allows users to rate each piece of content in x series of categories with a value from 1 to 10.

I need a way to accurately work-out the overall average value for each category. However, I am aware (thanks GCSE coursework) that the arithmetic average (add up all sums and divide by the frequency of numbers) can be "poisoned" or affected with outliers, extreme values.
I need a way to accurately determine the overall average but also taking into account each users "voting weight". Say one user had a voting rate percentage of 100%, their rate would take more weight than someone with a voting weight percentage of 10%. I would like the voting rate to be determined from the overall average for a category when calculated. Say I rated the content in the category with a value of 1 and when calculated the overall average of all votes is 9, I would like my voting percentage to decrease. This is vulnerable to people collaborating and systematically increasing their voting weight, however I am sure I can do some checking to detect such behavior.

I first thought of removing extreme values by arranging all values in ascending order and subtracting one from each other thus allowing me to get a list of the increasing amount. I imagined I could then remove extreme increasing amounts therefore meaning I could take extreme answers away. But after that I seemed to go around in circles trying different methods.

Anyone have any ideas how to calculate this?
I have written up a series of rules that have help me come to my conclution of how this works.

  • User has a voting weight
  • Content only affects the users weight after x number of votes have been cast on it.
  • There is an extremity factor in your ratings, for every point off of the rounded rating of the content, your ratings extremity is that. (If the avg was 7.5 and I voted 8 I'm fine, if i voted 10 that is an extremity of 2)
  • Your votes count 1 * your weight, so if i have a weight of 150, each vote counts 150 times.
  • The Content weight will be the total of all users' weights who have voted on the article at the time of the vote.
  • The number of votes and number of voters are different, becuase of the weight.  But this will give a better average based on weight.
Here is the formula for how it affects the user:

user.weight = (content.weight) / (allusers.totalweight - (Extremity^3))

EDIT:

I was informed that this wouldn't work as I said it, because the Content doesnt fall under a single category, but could be many.

So a fix using this same thing is a user would vote at each category, and then the articles weight would be the article's category weight.
littleguru
littleguru
<3 Seattle
yman wrote:
W3bbo has suggested to simply use the arithmetic mean due to the fact that the values can only be between 1 and 10, thus meaning there won't be any extreme outliers. He also suggested that I should look into Standard Deviation to determine each user's weight.


I would also go with that... Don't make it complicater than it is... Smiley
Dr Herbie
Dr Herbie
Horses for courses
yman wrote:
W3bbo has suggested to simply use the arithmetic mean due to the fact that the values can only be between 1 and 10, thus meaning there won't be any extreme outliers. He also suggested that I should look into Standard Deviation to determine each user's weight.


Outliers: Surely that's realtive and dependent on the number of people voting:  if 9 people vote 1 and 1 person votes 10 then the average is almost doubled from 1 to 1.9. If four people vote 1 and 1 person votes 10 then the mean moves from 1 to 2.8.
You could remove outliers by frequency : find the frequency of each possible score and ignore scores with a frequency of less than some arbitrary filter (like 5%). Unfortunately, this only works when there are sufficient votes (like more that 10 votes), and can cause problems if the votes are evenly spread and your filter level is too high (could potentialy remove all votes!).


Standard deviation:  Only to be used on parametric data.  In my above examples, the data is non-parametric. Your range of possible values is smal (1 - 10) so testing if the data is parametric is tricky. On the other hand, I have no idea if this type of data is likely to follow a Poisson distribution or not; it depends on the type of content (some content polarises views).
As your range is smal 1 - 10, I would stick to absolute values rather than standard deviation (like 'is within 2 of average', 'is within 4 of average', etc)

It's a tricky problem, but I think there's a case to be made for a 'suck it and see' attitude.  Try a set of rules, but make them easy to tweak if they don't work well.  Perhaps a lot of test scenarios would help.


Herbie
W3bbo
W3bbo
The Master of Baiters
yman wrote:
Thanks Dr Herbie, though I had to read your post twice to understand:P I'll keep your ideas in mind.


Point is: Your data on the scores of stuff won't follow the normal distribution. Despite that on a scale of 1-10, "5" should be the average number, human beings don't work like that. If it's average, it isn't voted on. People only vote if it's awful (1 star) or great (10 stars). Hence the reviews will be polarised and it wo'nt follow the normal distribution, making the standard deviation measure irrelevant.
Run the Student's T test on your sample set to detect the outliers and then calculate the average on the remaining values.
Dr Herbie
Dr Herbie
Horses for courses
yman wrote:
Thanks Dr Herbie, though I had to read your post twice to understand:P I'll keep your ideas in mind.


Sorry for the complexity of my post -- I'm not the best at explaining these things. Thanks to W3bbo for clarifying.


<Self_Promotion_Warning>

My blog entries on normal distributions might make the picture clearer (if you can ignore all the references to evolution) or they might make things less clear ....

</Self_Promotion_Warning>


I would be careful about over-engineering this as there may be little noticeable difference to the end user for a large amount of effort from the developers.  Unless you're enjoying it, of course Wink


Herbie
littleguru
littleguru
<3 Seattle
Isn't this all very complex for such a simple task? I mean, I might be wrong, but it seems to me that this is just a simple voting system, where you calculate the average ((i1 + i2 + i3 + ... + in)/n) and that's it. I mean you could calculate also the variance and do some reasoning over it, but isn't that not to much an overkill?
evildictaitor
evildictaitor
if( !succeed( try() ) ) { while(true) try(); }
littleguru wrote:
Isn't this all very complex for such a simple task? I mean, I might be wrong, but it seems to me that this is just a simple voting system, where you calculate the average ((i1 + i2 + i3 + ... + in)/n) and that's it. I mean you could calculate also the variance and do some reasoning over it, but isn't that not to much an overkill?


Noes! Don't recompute the sum, because that takes time and space to save.


public class RunningMean
        {
            private double N = 0;
            private double CurrentMean = 0;
            public double Mean
            {
                get
                {
                    return CurrentMean;
                }
            }
            public void Insert(double item)
            {
                CurrentMean = ((N / (N + 1)) * CurrentMean) + (item / (N + 1));
                N++;
            }
        }
Matthew van Eerde
Matthew van Eerde
AKA Maurits
I note that MSDN displays a histogram for rated articles instead of any kind of an average.
Dr Herbie
Dr Herbie
Horses for courses
evildictaitor wrote:

littleguru wrote: Isn't this all very complex for such a simple task? I mean, I might be wrong, but it seems to me that this is just a simple voting system, where you calculate the average ((i1 + i2 + i3 + ... + in)/n) and that's it. I mean you could calculate also the variance and do some reasoning over it, but isn't that not to much an overkill?


Noes! Don't recompute the sum, because that takes time and space to save.


public class RunningMean
        {
            private double N = 0;
            private double CurrentMean = 0;
            public double Mean
            {
                get
                {
                    return CurrentMean;
                }
            }
            public void Insert(double item)
            {
                CurrentMean = ((N / (N + 1)) * CurrentMean) + (item / (N + 1));
                N++;
            }
        }


<More self promotion>

In my sandbox entries, there's a couple of C# classes for basic statistics (and histograms). There's some unit testing code for them too somehwere in there.  Feel free to rip them off. Smiley

</More self promotion>


Herbie

Matthew van Eerde
Matthew van Eerde
AKA Maurits
evildictaitor wrote:

            private double N = 0;
            private double CurrentMean = 0;
            public double Mean
            {
                get
                {
                    return CurrentMean;
                }
            }
            public void Insert(double item)
            {
                CurrentMean = ((N / (N + 1)) * CurrentMean) + (item / (N + 1));
                N++;
            }


Could be simplified at the cost of some orthogonality...

            private double N = 0;
            private double CurrentTotal = 0;

            // koan: what is the mean of the empty set?
            private double CurrentMean = 0;

            public double Mean
            {
                get
                {
                    return CurrentMean;
                }
            }
            public void Insert(double item)
            {
                N++;
                CurrentTotal += item;
                CurrentMean = CurrentTotal / N; // N is > 0 by ++ above
            }

evildictaitor
evildictaitor
if( !succeed( try() ) ) { while(true) try(); }
Matthew van Eerde wrote:

Could be simplified at the cost of some orthogonality...


When the total sum gets very big your one overflows - mine never overflows unless the mean over/underflows, you insert infinity or NaN or N overflows Tongue Out
evildictaitor
evildictaitor
if( !succeed( try() ) ) { while(true) try(); }
Matthew van Eerde wrote:


            // koan: what is the mean of the empty set?



Ooh. Good question. There isn't one. When you define a set you must define the domain of the elements (which can just be the set if you want, but that isn't very useful), so for example the elements 1,3,5 can be defined as the integer set {1,3,5} or the real set {1.0,3.0,5.0} or the complex set {1+0i,3+0i,5+0i} etc.

Any internal operation on the set must be strictly a member of the set, so the mean of int{1,3,5} is 2, of {1.0,3.0,5.0} is 2.666... and of {1+0i,3+0i,5+0i} is 2.66....+ 0i.

The only exceptions are the length of a set (which is a set operator) which is defined as a Natural number and the subset of a set, which is a set over the same domain (or a subset of the same domain).

The problem with the empty set is that you can define it's domain-independant. This means that not only is the mean undefined, it's domain is unconstrained, so it might not even be a number.
page 1 of 1
Comments: 16 | Views: 7930
Microsoft Communities