Data cleaning

Data cleaning

Depending on your needs, you may want to add a few items into your plays dataframe. Doing so can help navigate your data more efficiently in the long run.

A simple data clean can be used with pyjanitor. Where we can remove all empty columns (if any exist).

Code
import pandas as pd
pd.set_option('display.max_columns', None) 
pd.set_option('display.width', 300)
from datavolley import read_dv
from janitor import remove_empty
dv_instance = read_dv.DataVolley(None)
df = dv_instance.get_plays()
df = remove_empty(df)
print(df[df['skill'].notna()])
                                  match_id video_file_number video_time                code                      team player_number      player_name player_id      skill evaluation_code setter_position attack_code set_code set_type start_zone end_zone end_subzone num_players_numeric  \
4     017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1        494   *19SM+~~~78A~~~00  University of Louisville            19  Shannon Shields   -296094      Serve               +               1         NaN      NaN      NaN          7        8           A                 NaN   
5     017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1        495  a02RM-~~~58AM~~00B      University of Dayton             2    Maura Collins   -230138  Reception               -               6         NaN      NaN      NaN          5        8           A                 NaN   
6     017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1        497   a08ET#~~~~8C~~~00      University of Dayton             8  Brooke Westbeld   -232525        Set               #               6         NaN      NaN        ~        NaN        8           C                 NaN   
7     017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1        499  a10AT-X5~46CH2~00F      University of Dayton            10   Jamie Peterson    -11802     Attack               -               6          X5      NaN      NaN          4        6           C                   2   
8     017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1        499   *11BT+~~~~2C~~~00  University of Louisville            11   Anna Stevenson   -278838      Block               +               1         NaN      NaN      NaN        NaN        2           C                 NaN   
...                                    ...               ...        ...                 ...                       ...           ...              ...       ...        ...             ...             ...         ...      ...      ...        ...      ...         ...                 ...   
1476  017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1       5217   *08EH#~~~~9D~~~+6  University of Louisville             8    Lexi Hamilton    -75970        Set               #               4         NaN      NaN        ~        NaN        9           D                 NaN   
1477  017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1       5218  *10AH=V5~44BH2~+6F  University of Louisville            10      Mel McHenry    -75967     Attack               =               4          V5      NaN      NaN          4        4           B                   2   
1478  017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1       5219             ap24:19      University of Dayton           NaN              NaN       NaN      Point             NaN               3         NaN      NaN      NaN        NaN      NaN         NaN                 NaN   
1484  017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1       5252   a18SM=~~~71B~~~-5      University of Dayton            18      Grace Dynda   -282421      Serve               =               2         NaN      NaN      NaN          7        1           B                 NaN   
1485  017366f2-d6b1-4bed-ab7f-0d1dcc7b4097                 1       5253             *p25:19  University of Louisville           NaN              NaN       NaN      Point             NaN               4         NaN      NaN      NaN        NaN      NaN         NaN                 NaN   

      home_team_score  visiting_team_score home_setter_position visiting_setter_position custom_code home_p1 home_p2 home_p3 home_p4 home_p5 home_p6 visiting_p1 visiting_p2 visiting_p3 visiting_p4 visiting_p5 visiting_p6  start_coordinate  mid_coordinate  end_coordinate point_phase   attack_phase  \
4                   1                    0                    1                        6          00      19       9      11      15      10       7           1          16          17          10           6           8               431            <NA>            7642       Serve            nan   
5                   1                    0                    1                        6         00B      19       9      11      15      10       7           1          16          17          10           6           8               431            <NA>            7642   Reception            nan   
6                   1                    0                    1                        6          00      19       9      11      15      10       7           1          16          17          10           6           8              3147            <NA>            <NA>   Reception            nan   
7                   1                    0                    1                        6         00F      19       9      11      15      10       7           1          16          17          10           6           8              4512            5522            8150   Reception      Reception   
8                   1                    0                    1                        6          00      19       9      11      15      10       7           1          16          17          10           6           8              4578            <NA>            <NA>       Serve            nan   
...               ...                  ...                  ...                      ...         ...     ...     ...     ...     ...     ...     ...         ...         ...         ...         ...         ...         ...               ...             ...             ...         ...            ...   
1476               24                   19                    4                        3          +6      17      10       7      19       9      11          10          15           8           1          16           3              2469            <NA>            <NA>       Serve            nan   
1477               24                   19                    4                        3         +6F      17      10       7      19       9      11          10          15           8           1          16           3              4211            <NA>            5270       Serve  BP-Transition   
1478               24                   19                    4                        3        None      17      10       7      19       9      11          10          15           8           1          16           3              4830            <NA>            <NA>   Reception            nan   
1484               25                   19                    4                        2          -5      17      10       7      19       9      11          18           8           1          16           3          10               337            <NA>            8812       Serve            nan   
1485               25                   19                    4                        2        None      17      10       7      19       9      11          18           8           1          16           3          10              1288            <NA>            <NA>   Reception            nan   

     start_coordinate_x start_coordinate_y mid_coordinate_x mid_coordinate_y end_coordinate_x end_coordinate_y set_number                 home_team         visiting_team home_team_id visiting_team_id              point_won_by              serving_team            receiving_team  rally_number  \
4               1.26875           0.092596             <NA>             <NA>          1.68125         5.425924          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville      University of Dayton             1   
5               1.26875           0.092596             <NA>             <NA>          1.68125         5.425924          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville      University of Dayton             1   
6               1.86875           2.092594             <NA>             <NA>             <NA>             <NA>          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville      University of Dayton             1   
7               0.55625            3.12963          0.93125          3.87037          1.98125         5.796294          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville      University of Dayton             1   
8               3.03125            3.12963             <NA>             <NA>             <NA>             <NA>          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville      University of Dayton             1   
...                 ...                ...              ...              ...              ...              ...        ...                       ...                   ...          ...              ...                       ...                       ...                       ...           ...   
1476            2.69375           1.574076             <NA>             <NA>             <NA>             <NA>          3  University of Louisville  University of Dayton           17               42      University of Dayton  University of Louisville      University of Dayton            43   
1477            0.51875           2.907408             <NA>             <NA>          2.73125         3.648148          3  University of Louisville  University of Dayton           17               42      University of Dayton  University of Louisville      University of Dayton            43   
1478            1.23125           3.351852             <NA>             <NA>             <NA>             <NA>          3  University of Louisville  University of Dayton           17               42      University of Dayton  University of Louisville      University of Dayton            43   
1484            1.49375           0.018522             <NA>             <NA>          0.55625         6.314812          3  University of Louisville  University of Dayton           17               42  University of Louisville      University of Dayton  University of Louisville            44   
1485            3.40625           0.685188             <NA>             <NA>             <NA>             <NA>          3  University of Louisville  University of Dayton           17               42  University of Louisville      University of Dayton  University of Louisville            44   

      possesion_number  
4                    0  
5                    1  
6                    1  
7                    1  
8                    2  
...                ...  
1476                 2  
1477                 2  
1478                 3  
1484                 0  
1485                 1  

[928 rows x 56 columns]

Perhaps you might want to change the match_id to the filename of the dvw.

Code
dv_instance = read_dv.DataVolley(None)
new_match_id = dv_instance.file_path.split('\\')[-1].split('.dvw')[0]
df['match_id'] = new_match_id
print(df[df['skill'].notna()].head())
       match_id video_file_number video_time                code                      team player_number      player_name player_id      skill evaluation_code setter_position attack_code set_code set_type start_zone end_zone end_subzone num_players_numeric  home_team_score  visiting_team_score  \
4  example_data                 1        494   *19SM+~~~78A~~~00  University of Louisville            19  Shannon Shields   -296094      Serve               +               1         NaN      NaN      NaN          7        8           A                 NaN                1                    0   
5  example_data                 1        495  a02RM-~~~58AM~~00B      University of Dayton             2    Maura Collins   -230138  Reception               -               6         NaN      NaN      NaN          5        8           A                 NaN                1                    0   
6  example_data                 1        497   a08ET#~~~~8C~~~00      University of Dayton             8  Brooke Westbeld   -232525        Set               #               6         NaN      NaN        ~        NaN        8           C                 NaN                1                    0   
7  example_data                 1        499  a10AT-X5~46CH2~00F      University of Dayton            10   Jamie Peterson    -11802     Attack               -               6          X5      NaN      NaN          4        6           C                   2                1                    0   
8  example_data                 1        499   *11BT+~~~~2C~~~00  University of Louisville            11   Anna Stevenson   -278838      Block               +               1         NaN      NaN      NaN        NaN        2           C                 NaN                1                    0   

  home_setter_position visiting_setter_position custom_code home_p1 home_p2 home_p3 home_p4 home_p5 home_p6 visiting_p1 visiting_p2 visiting_p3 visiting_p4 visiting_p5 visiting_p6  start_coordinate  mid_coordinate  end_coordinate point_phase attack_phase start_coordinate_x start_coordinate_y  \
4                    1                        6          00      19       9      11      15      10       7           1          16          17          10           6           8               431            <NA>            7642       Serve          nan            1.26875           0.092596   
5                    1                        6         00B      19       9      11      15      10       7           1          16          17          10           6           8               431            <NA>            7642   Reception          nan            1.26875           0.092596   
6                    1                        6          00      19       9      11      15      10       7           1          16          17          10           6           8              3147            <NA>            <NA>   Reception          nan            1.86875           2.092594   
7                    1                        6         00F      19       9      11      15      10       7           1          16          17          10           6           8              4512            5522            8150   Reception    Reception            0.55625            3.12963   
8                    1                        6          00      19       9      11      15      10       7           1          16          17          10           6           8              4578            <NA>            <NA>       Serve          nan            3.03125            3.12963   

  mid_coordinate_x mid_coordinate_y end_coordinate_x end_coordinate_y set_number                 home_team         visiting_team home_team_id visiting_team_id              point_won_by              serving_team        receiving_team  rally_number  possesion_number  
4             <NA>             <NA>          1.68125         5.425924          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville  University of Dayton             1                 0  
5             <NA>             <NA>          1.68125         5.425924          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville  University of Dayton             1                 1  
6             <NA>             <NA>             <NA>             <NA>          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville  University of Dayton             1                 1  
7          0.93125          3.87037          1.98125         5.796294          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville  University of Dayton             1                 1  
8             <NA>             <NA>             <NA>             <NA>          1  University of Louisville  University of Dayton           17               42  University of Louisville  University of Louisville  University of Dayton             1                 2  

Any data cleaning taken place can prove useful long term. Perhaps there is a nested folder which contains the week of the season, the league, the conference, maybe the file has the correct date. Parsing additional data into your dataset will give more tools in your data journey.