Data analysis of the PL

Leeds United news here, transfer rumours, club affairs, players, fans, etc.
Specific match discussions should go in the category below.
Post Reply
Aqua-
Simon Grayson's Hairdresser
Posts: 547
Joined: 20 Apr 2014, 22:13

Data analysis of the PL

Post by Aqua- »

Hi,

I am studying physics in uni, and much of the work is about gathering, planning, and analyzing data.
With the high expectations from the team in our first season in PL since 16 years, i decided to download some data from the internet and analyze it to try and have some quantitative info about how well we are progressing each week, how many points are usually enough to promise staying in the league, europa league spot, champions league spot etc...
The data i used was taken from https://www.football-data.co.uk/englandm.php .
Each file contains a match list of a specific season starting from 93/94 season up to last season (a csv file with the current season is updated as well!).
The stats presented in each file are different, with later seasons featuring more stats. Most if not all contain FT results and HT results, other stats documented are ref name, attendance, date, corners, free throws, shots, shots on target etc...

I analyze the data using python code, and i save every piece of data as a csv (excel), so if anyone is interested in seeing the code and/or have his hands on the files just reply in this thread or PM me, i'd love to share it.
I've started with a very basic analysis here, however, i hope to keep work on it and find some more interesting stats.
If anyone has an idea for something to check tell me about it and i'll add it to my to do list :D.

one last thing - statistics are not alway right, it is rather a mean to describe a standard. It does not claim to predict the future. If there is something that can be said on this particular season is that it is'nt standard at all! With no crowd at the stadiums, denser scheduale, prolonged transfer window, possible corona infection of individual players, staff members and their relatives it will be interesting to see how this season differs in numbers.

So without further ado i shall present my results:

* In order to finish in the 17th place, which promise another season in the first tier, in an ordinary season a club would have to secure in avarage 38 points, with a standard deviation of 2.47 points, where the maximum upper deviation is 6 points (i.e. 44 pts) and the maximum lower deviation is 4 point (i.e. 34 points). The median for this position is also 38 points.
- A mean of 8.96 +- 2.32 pts should be gained against teams that will be relegated (out of 18 possible)
- A mean of 6.84 +- 3.20 pts should be gained against teams that will finish in lower midtable (places 14-17) (out of 18 possible, since we finish 17th)
- A mean of 13.2 +- 3.24 pts should be gained against teams that will finish in upper midtable (places 8-13) (out of 36 possible)
- A mean of 5.28 +- 2.32 pts should be gained against teams that will be qualified to uefa league (places 5-7) (out of 18 possible)
- A mean of 3.04 +- 2.20 pts should be gained against teams that will be qualified to the champions league (places 2-4) (out of 18 possible)
-** A mean of 0.8 +- 1.26 pts should be gained against team that will win the league (1st place) (out of 6 possible)


* In order to finsih in a top half position (10th or higher) we will need to secure an avarage of 49 pts at least, with a standard deviation of 2.66 pts, where both max upper & lower deviation being 5 pts (i.e. 44 & 54 respectively). The median for 10th place is 50 pts.
- A mean of at least 11.24 +- 2.87 pts should be gained against teams that will be relegated (out of 18 possible)
- A mean of 13.4 +- 3.12 pts should be gained against teams that will finish in lower midtable (places 14-17) (out of 24 possible)
- A mean of 13.76 +- 3.33 pts should be gained against teams that will finish in upper midtable (places 8-13) (out of 30 possible, since we finish 10th)
- A mean of 6.08 +- 2.15 pts should be gained against teams that will be qualified to uefa league (places 5-7) (out of 18 possible)
- A mean of 3.96 +- 2.57 pts should be gained against teams that will be qualified to the champions league (places 2-4) (out of 18 possible)
-** A mean of 0.8 +- 0.94 pts should be gained against team that will win the league (1st place) (out of 6 possible)

* In order to sneak to a european qualification spot (assuming 1 cup will be won by a top 5 club, hence 6th place will lead to europe) we will need to secure an avarage of 61 pts at least, with a higher standard deviation of 3.75 pts, where both max upper & lower deviations being 8 pts (i.e. 53 & 69 respectively). The median for 6th is 61 pts.
- A mean of 12.24 +- 3.13 pts should be gained against teams that will be relegated (out of 18 possible)
- A mean of 14.4 +- 3.26 pts should be gained against teams that will finish in lower midtable (places 14-17) (out of 24 possible)
- A mean of 20.76 +- 2.87 pts should be gained against teams that will finish in upper midtable (places 8-13) (out of 36 possible)
- A mean of 5.68 +- 2.54 pts should be gained against teams that will be qualified to uefa league (places 5-7) (out of 12 possible, since we finish 6th)
- A mean of 6.68 +- 2.59 pts should be gained against teams that will be qualified to the champions league (places 2-4) (out of 18 possible)
-** A mean of 1.44 +- 1.36 pts should be gained against team that will win the league (1st place) (out of 6 possible)

* For the cheeky ones out there - to win the league we will need a high score avarage of 87 pts, with a high standard deviation of 6.5 pts, where the max upper deviation is 13 pts (man city with a 100 pts) and the max lower deviation is 12 pts (scum with 75 pts). With the median being the same as the avarage - 87pts.
- A mean of 16.28 +- 1.40 pts should be gained against teams that will be relegated (out of 18 possible)
- A mean of 19.32 +- 3.32 pts should be gained against teams that will finish in lower midtable (places 14-17) (out of 24 possible)
- A mean of 27.84 +- 3.55 pts should be gained against teams that will finish in upper midtable (places 8-13) (out of 36 possible)
- A mean of 12.76 +- 3.22 pts should be gained against teams that will be qualified to uefa league (places 5-7) (out of 18 possible)
- A mean of 11.2 +- 2.97 pts should be gained against teams that will be qualified to the champions league (places 2-4) (out of 18 possible)
-** No need to take any points as we are the champions!

** the last stat (pts against champions) seem to give nonsenseical values with standard deviation which is greater than the value, however this is because, as i mentioned before, statistics are not doing well with little data. when it comes down to 1/2 games per season there are too many variables to take into account (injuries, momentum, if the champions already promised their championship etc...), and so statistic usually are not giving too much. i shall consider adding the champions one group down to CL qualified to make it places 1-4. will be happy to hear your opinion.

enjoy and stay tuned. :D

edit:
i realized some might not be familiar with the exact meaning of the stsatistical terms, so i'll explain briefly:
mean value - equal to the avarage in all my analysis - the mean doesnt have to be a value within the population or even a possible result.
standard deviation - traditionaly defined as the set of values which contained within the 68% most probable values (34.1% from each direction of the mean value). In practice what it means is that if we try to guess, say, how many points the 17th place gonna have this (or any other) season, the most probable value would be the mean (38) +- the standard deviation (2.47) with a confidence level of 68% - meaning 68% of the seasons the values would be within 38+-2.47 pts.
median - the middle number in a set of values. say we have the set {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} the median would be (5+6)/2=11/2=5.5. the median also doesnt has to be a value in the population or even a possible value (say it can be decimal e.g 37.6).
maximum upper/lower deviation - the maximum deviation from the mean value to either side of the mean - upper -> higher value than mean, lower -> lower value than mean.

2nd edit:
for a lot of the stats i present i dropped the 93/94 & 94/95 seasons as they feature 22 clubs, so things like absolut number of points are due to deviate greatly from a 20 club season.
Last edited by Aqua- on 03 Oct 2020, 10:30, edited 3 times in total.
User avatar
kk white
Raich Carter's Contract Agent
Posts: 3594
Joined: 12 Aug 2009, 14:23
Location: Galway

Re: Data analysis of the PL

Post by kk white »

Love this stuff Aqua. Cheers.

Surprised how close in points safety (17th, c 38 points) and Top 10 (c 49 points) are.

Also love your note after the Champions stats: "-** No need to take any points as we are the champions!" :D :thumbup:
"An astonishing number of people despise Leeds United or what Leeds United stand for. But this club was never made for them." - Phil Hay
User avatar
johnh
Bielsa's English Teacher
Posts: 8522
Joined: 24 Jan 2012, 15:26

Re: Data analysis of the PL

Post by johnh »

Yes, I used to love this stuff too but these days my brain hurts. I said 'au revoir' to Monsieur Poisson a long time ago.
I once played against Don Revie.
Sovietmule
Paul Heckingbottom's career advisor
Posts: 155
Joined: 16 May 2019, 17:01

Re: Data analysis of the PL

Post by Sovietmule »

I love a bit of data.

Don't know if you saw the animation of Player of the Season http://www.lufctalk.com/forums/viewtopic.php?f=8&t=6104 I did for last season; might be something to consider to bring the data to life?

Good luck.

p.s. I gave up on python a while back as I was trying to get into it running ubuntu via crouton on a chromebook and it just seemed to be a giant pain. Anyway, I've got some php/codeigniter so if you need any help in that direction just give me a shout ...
Davycc
LUFCTALK Moderator
Posts: 15076
Joined: 03 Aug 2011, 18:09
Location: Location Location

Re: Data analysis of the PL

Post by Davycc »

I've half a bottle of Rioja downed ... I may look at this tomorrow :thumbup:
All at Amazon Books

The Funny Corner
When Santa Got Stuck Up The Chimney
The Thrones Murders
Aqua-
Simon Grayson's Hairdresser
Posts: 547
Joined: 20 Apr 2014, 22:13

Re: Data analysis of the PL

Post by Aqua- »

@kk_white It was surprising to me as well. There is a lot of interesting data, i just need to find a better way to present it than write it down like this. tables are always more intuitive and impressive. will work on it.

@Johnh you sit back and enjoy the stats. if you have any nice or interesting idea of something to look for i'd love to hear

@SM i didn't see the post till now. this is very cool! makes me think - the raw data i posted (link) features a game by game score with date, this animation can show a week by week development of each season. this is quite nice! i will check out this flourish software which i never heard of, but came across some of those animation before.
regarding crouton i havent heard of it before, but it seems for advanced usage. i use python on ubuntu and it works perfectly. anyway, cheers for using open source softwares.
Sovietmule
Paul Heckingbottom's career advisor
Posts: 155
Joined: 16 May 2019, 17:01

Re: Data analysis of the PL

Post by Sovietmule »

Aqua- wrote: @SM i didn't see the post till now. this is very cool! makes me think - the raw data i posted (link) features a game by game score with date, this animation can show a week by week development of each season. this is quite nice! i will check out this flourish software which i never heard of, but came across some of those animation before.
regarding crouton i havent heard of it before, but it seems for advanced usage. i use python on ubuntu and it works perfectly. anyway, cheers for using open source softwares.
Flourish is great for animating data and so easy to use. There are plenty of toots on YT should you need them.

I had a few beers last night so it may be that I'm being slower than usual but ... I couldn't see any links to your data, have I missed it somewhere?

Cheers.
Aqua-
Simon Grayson's Hairdresser
Posts: 547
Joined: 20 Apr 2014, 22:13

Re: Data analysis of the PL

Post by Aqua- »

Yeah... there are no links to my data, just for the raw data which i downloaded.
the analyzed data will be uploaded here: https://drive.google.com/drive/folders/ ... sp=sharing .
for the next post i will present tables as well with the data
Last edited by Aqua- on 06 Oct 2020, 19:08, edited 1 time in total.
Sovietmule
Paul Heckingbottom's career advisor
Posts: 155
Joined: 16 May 2019, 17:01

Re: Data analysis of the PL

Post by Sovietmule »

When I was trying to get to grips with python I think I was using Django. I couldn't get on with it and it seemed very command line based.
I might have another look at python though. Have you got any recommendations for someone who has a coding background looking to get started?
Cheers
Aqua-
Simon Grayson's Hairdresser
Posts: 547
Joined: 20 Apr 2014, 22:13

Re: Data analysis of the PL

Post by Aqua- »

More stats:

* In an ordinary season, the Home Win % of a club finishing in each posiotion in the table is presented in the following table:
Image

The distribution of HW% against different position groups is presented in the following table:
Image



* The Home Draw % is presented in the following table:
Image

The distribution of HD% against different position groups is presented in the following table:
Image



* The Away Win % is presented in the following table:
Image

The distribution of AW% against different position groups is presented in the following table:
Image



* The Away Draw % is presented in the following table:
Image

The distribution of AD% against different position groups is presented in the following table:
Image


*edit - the images are links. click on them to open the full size images
**2nd edit - i just noticed that in all colorful tables the 20th position are wrong since they are not transformed into percentage form, so do notice. I fixed it in my code, and will re-upload later on.
Post Reply