Jump to content

Parse HTML table with Python

Hey!

I get bowling results on my email. I have used Gmail API to get access to my email, find the emails and get to the result link. Now I am trying to parse the HTML table that is on the link, into a readable/logical table. I don't even know if I should use Pandas or just write to CSV file. I am using BeautifulSoup to parse through the HTML, but the table structure is so messy, I can't wrap my head around it. Here is a short example of the table code. Usually there are multiple tables on the page, tables sometimes with multiple player results, so it all makes that much harder.

 

<table class="scoresheet tenpin TenPin">
    <tr>
     <td>
      <table class="ss-data">
       <tr class="cls_frameheader">
        <td class="ss-name">
         <t>
          Players
         </t>
        </td>
        <td>
         1
        </td>
        <td>
         2
        </td>
        <td>
         3
        </td>
        <td>
         4
        </td>
        <td>
         5
        </td>
        <td>
         6
        </td>
        <td>
         7
        </td>
        <td>
         8
        </td>
        <td>
         9
        </td>
        <td class="cls_framehdr10">
         10
        </td>
        <td class="cls_frametotal">
         <t>
          Total
         </t>
        </td>
       </tr>
        <td class="cls_player">
       <tr class="notranslate">
         Anonymous
        </td>
        <td class="cls_frame">
         <table>
          <tr>
           <td>
            <table class="cls_tbl_balls">
             <tr class="cls_ball_row">
              <td class="cls_ball1">
               5
              </td>
              <td class="cls_ball2">
               /
              </td>
             </tr>
            </table>
           </td>
          </tr>
          <tr>
           <td class="cls_framescore">
            18
           </td>
          </tr>
         </table>
        </td>
        <td class="cls_frame">
         <table>
          <tr>
           <td>
            <table class="cls_tbl_balls">
             <tr class="cls_ball_row">
              <td class="cls_ball1">
               8
              </td>
              <td class="cls_ball2">
               1
              </td>
             </tr>
            </table>
           </td>
          </tr>
          <tr>
           <td class="cls_framescore">
            27
           </td>
          </tr>
         </table>
        </td>
        <td class="cls_frame">
         <table>
          <tr>
           <td>
            <table class="cls_tbl_balls">
             <tr class="cls_ball_row">
              <td class="cls_ball1">
               5
              </td>
              <td class="cls_ball2">
               -
              </td>
             </tr>
            </table>
           </td>
          </tr>
          <tr>
           <td class="cls_framescore">
            32
           </td>
          </tr>
         </table>
        </td>
        <td class="cls_frame">
         <table>
          <tr>
           <td>
            <table class="cls_tbl_balls">
             <tr class="cls_ball_row">
              <td class="cls_ball1">
               4
              </td>
              <td class="cls_ball2">
               /
              </td>
             </tr>
            </table>
           </td>
          </tr>
          <tr>
           <td class="cls_framescore">
            49
           </td>
          </tr>
         </table>
        </td>
        <td class="cls_frame">
         <table>
          <tr>
           <td>
            <table class="cls_tbl_balls">
             <tr class="cls_ball_row">
              <td class="cls_ball1">
               7
              </td>
              <td class="cls_ball2">
               1
              </td>
             </tr>
            </table>
           </td>
          </tr>
          <tr>
           <td class="cls_framescore">
            57
           </td>
          </tr>
         </table>
        </td>

That is the gist of the table. One full table is in the uploaded files. Could you guys give pointers on what to do, ideas on what other modules to use?

test1.html

Link to comment
Share on other sites

Link to post
Share on other sites

Use beautiful soup

CPU:R9 3900x@4.5Ghz RAM:Vengeance Pro LPX @ 3200mhz MOBO:MSI Tomohawk B350 GPU:PNY GTX 1080 XLR8

DRIVES:500GB Samsung 970 Pro + Patriot Blast 480GB x2 + 12tb RAID10 NAS

MONITORS:Pixio PX329 32inch 1440p 165hz, LG 34UM68-p 1080p 75hz

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

I am trying with BeautifulSoup, but I am so overwhelmed, I don't know what kind of functions should I use, to get the best result out of this table. I have tried this:

records = []
for tr in table_soup.findAll("tr"):
	trs = tr.findAll("td")
    records.append(trs[0].text.replace('\n', ' '))

df = pd.DataFrame(data=records)

And the result is:

    Players 1 2 3 4 5 6 7 8 9 10 Total   Anonymous1       5 /      18          8 1      27          5 -      32          4 /      49          7 1      57          9 /      73          6 /      90          7 /      110          X       130          7 / 7      147       147      Anonymous2       7 2      9          6 3      18          - 2      20          - -      20          1 2      23          9 /      33          - 9      42          4 3      49          5 /      69          X - 1      80       80      Anonymous3       7 2      9          - 4      13          7 -      20          9 -      29          3 -      32          4 -      36          6 /      55          9 -      64          9 -      73          - / 1      84       84      Anonymous4       - 9      9          3 -      12          - 8      20          4 5      29          8 -      37          9 /      57          X       76          9 -      85          9 /      105          X 6 -      121       121       Game 1

So I can kinda get the result, but it also somewhy prints this same line all out on different rows before it gets to the second game results, that are in one row. Also I am so lost as to what to do next, I can get the frame numbers and total as column headers, but parsing results and totals as different variables/strings seems way too hard.

Link to comment
Share on other sites

Link to post
Share on other sites

How about pandas.read_html? :) But yes, structure seems kind of crazy...

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...

I haven't found time to work on my script for some time now. But now I have and figured I'd update.

pandas.read_html really did work!! Don't know, how I missed that function. ?

pd.read_html(link, header=0, index_col=0, attrs={'class': 'ss-data'})

That one line did it. Gets all the player names, frames and the results. Script is almost done, working now on pretty printing it to excel or csv. So I could make some statistics and stuff.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×