Python - Need to simplify some code

EIijah1 · May 20, 2022

I am turning a text file with a lot of useless information, into a csv file with only the information I need for the block of text.

So far I'm able to pull the information out of the text file with some consistency, however I feel like it's too much code for what I'm doing.

my_file = open("UserData.txt", "r")
content = my_file.read()
print(content)
content_list = content.split('\n')
my_file.close()
converted_list = []
for element in content_list:
    converted_list.append(element.strip('-'))
converted_list = list(filter(None, converted_list))
converted_list = list(dict.fromkeys(converted_list))
TheRealList = []

pattern = ['*Payroll Number*', '*First Name*', '*Preferred*', '*Surname*', '*Program*', '*Position Title*', '*Employee Type*', '*Manager*']
for item in pattern:
    matching = fnmatch.filter(converted_list, item)
    print(matching)
    test3 = ' '.join(matching)
    TheRealList.append(test3)


EmployeeIDstring = TheRealList[0]
EmployeeIDList = EmployeeIDstring.split(" ")
EmployeeIDindex = EmployeeIDList.index("Number")
EmployeeID = EmployeeIDList[EmployeeIDindex+1]

print(EmployeeID)

My main issue is the block that handles "EmployeeID". It functions well, but it has to be repeated for each piece of information I want.
My other issue is that I only know certain bits of information, for example, I know that right before the employee ID is the word "Number" which is how I'm pulling the information from my list
I also have to know at which index the employee ID will appear but that should be consistent.
I'm not looking for any handouts, but I've been stuck on trying to turn this code into a loop for a day or so now and googling hasn't gotten me far.

Any ideas or suggestions would be great

Pandur · May 20, 2022

If you can use imports in your script, check out pandas.

elpiop · May 21, 2022

If you can provide a sample input and expected output that would be helpful

EIijah1 · May 21, 2022

3 hours ago, elpiop said:

If you can provide a sample input and expected output that would be helpful

Oh yes my bad, here's an example input:
EmployeeID: 123456
Issue at 18/05/2022
Staff Member Name Joe Bloggs
First Name Joe
Preferred Name Jo
Middle Name Melanie
Surname Bloggs
Start Date 27/06/2022
Contract End Date
Country New Zealand

Output would be
123456
Joe
Bloggs
27/06/2022
New Zealand

C2dan88 · May 21, 2022

If that is how your data is stored. I would create an array of all the fields you want to retrieve data for (the field meaning the beginning of the line before the value)

fields = [
    "EmployeeID",
    "Issue at",
    "Staff Member Name",
    "First Name",
    "Preferred Name",
    "Middle Name",
    "Surname"
    "Start Date",
    "Contract End Date",
    "Country"
]

Then I would read the contents of the file. Loop over all the fields and replace it to inject a separator character.

my_file = open("UserData.txt", "r")
file_contents = my_file.read()

# loop over all the fields, adding a character (in this case a ':' colon ) to seperate the fields from the value 
for field in fields:
    file_contents = file_contents.replace(field, f"{field}:")

Now we have a separator, we can simply loop over the lines in the file and split the line on that separator to get the field value pair. The pair can be used to construct a dict array for storing the data

# using the field array, create a dict
data = dict.fromkeys(fields)

# loop over the lines in the file contents
for line in file_contents.split('\n'):
    # split the line on the field value seperator
    parts = line.split(':')
    # check to make sure there are two parts. As not all fields may have a value
    if len(parts) == 2:
        (field, value) = parts
        # here we use the field as the key to assign the value in the array
        data[field] = value.strip() # strip used to remove remaining whitespace

# our data should now be stored neatly in a dict
print(data)

Now our data is in a dict we can simply use the field name to retrieve the value.

print(f"EmployeeID {data['EmployeeID']} name is {data['First Name']} {data['Surname']} they started on {data['Start Date']}")

# Example output would be
# EmployeeID 123456 name is Joe Bloggs they started on 27/06/2022

mariushm · May 30, 2022

In php I'd do it like this:

<?php

$content = file_get_contents('filename.txt');

// Windows uses \r\n for enter, linux only \n , some older macs only \r
// make sure our line separator is "\n";
$content_lines = str_replace("\r","\n",$content_lines);
$content_lines = explode("\n",$content);

$all = [];  // holds all the info, an array of $person
$person = []; // holds info of a single person
$fields = array (
    "id" => "EmployeeID",
    "issue" => "Issue at",
    "staff" => "Staff Member Name",
    "first" => "First Name",
    "preferred" => "Preferred Name",
    "middle" => "Middle Name",
    "surname" => "Surname"
    "date_start" => "Start Date",
    "date_finish" => "Contract End Date",
    "country" => "Country",
);
// go line by line
foreach ($content_lines as $line_id => $line) {
	foreach ($fields as $field_id => $field_string) {
		// case insensitive search in the text line for the text ex "issue at" , "surname"
		$pos = stripos($line, $field_string);
		if ($pos!==FALSE) {
			$value = substr($line,$pos+strlen($field_string)); // from where field text ends to the end of line
			// remove space, : and \r (if text file in Windows, it used \r\n for new line and we split lines only on \n			
			$value = trim($value," :\r");	
			if ($field_id=='id') {
				// assume ID is always the first, which means data for new person starts so submit the old one and empty person
				if (count($person) > 0) { array_push($all,$person); $person = []; }
			}
			$person[$field_id] = $value;
		}
	}
}
if (count($person)>0) array_push($all,$person);

?>

susPenguin · June 5, 2022

Hi, I appreciate the thread and I hope this assists on your conquest.

The format of "raw.txt" is reflective of the format provided on:

I then generated 100 unique results to provide some scalability and ensure no accidental errors with irregular data. Please note, if "raw.txt" begins with a "\n" (newline), the Python script will fail.

The "clean.csv" is reflective of the requested cleaned CSV file. You may edit the "replacement_terms" of the correct pretext strings (i.e. replace "EmployeeID" with "Payroll Number") in the script to be reflective of your true headers.

replacement_terms = {"\n": "", "EmployeeID: ": "\n", "Issue at ": "", "Staff Member Name ": "", "First Name ": "",
                     "Preferred Name ": "", "Middle Name ": "", "Surname ": "", "Start Date ": "", "Contract End ": "",
                     "Country ": ""}

The main.py is the actual script for making these edits. Your primary focus should be on

employee_list = []
for line in f:
    for key, value in replacement_terms.items():
        line = line.replace(key, value)
    if "\n" in line and line != "\n":
        line = line.replace("\n", "")
        if len(employee_list) > 0:
            employees_list += [employee_list.copy()]     
        employee_list.clear()
    employee_list.append(line)
f.close()

This is the actual code that does the edits (with the above mentioned replacement_terms on line 2). I think the script should be fairly explainable. I'm confident it's far from most optimized. However, it works decently well and does not use any imports.

Another thing to note, with clean.csv, it'll rewrite your old data; if desired to append information as you go (rather than destruct), you can remove lines:

# Let's write the CSV header
f = open("clean.csv", "w")
for key in replacement_terms.keys():
    if key != "\n":
        f.write(key + ",")
f.write("\n")
f.close()

The "w" means "write" which will forcefully write over the existing .csv; removing this will leave only the appending "a" of the CSV body portion.

I threw in another script, "generate_test_data.py" to generate the exam raw.txt data. This does include an import as it falls outside the original request. However, it's useful to show how this information was generated.

clean.csv main.py raw.txt generate_test_data.py

EIijah1 · June 13, 2022

On 6/5/2022 at 6:28 PM, susPenguin said:
Hi, I appreciate the thread and I hope this assists on your conquest.

The format of "raw.txt" is reflective of the format provided on:

I then generated 100 unique results to provide some scalability and ensure no accidental errors with irregular data. Please note, if "raw.txt" begins with a "\n" (newline), the Python script will fail.

The "clean.csv" is reflective of the requested cleaned CSV file. You may edit the "replacement_terms" of the correct pretext strings (i.e. replace "EmployeeID" with "Payroll Number") in the script to be reflective of your true headers.
replacement_terms = {"\n": "", "EmployeeID: ": "\n", "Issue at ": "", "Staff Member Name ": "", "First Name ": "",
                     "Preferred Name ": "", "Middle Name ": "", "Surname ": "", "Start Date ": "", "Contract End ": "",
                     "Country ": ""}
The main.py is the actual script for making these edits. Your primary focus should be on
employee_list = []
for line in f:
    for key, value in replacement_terms.items():
        line = line.replace(key, value)
    if "\n" in line and line != "\n":
        line = line.replace("\n", "")
        if len(employee_list) > 0:
            employees_list += [employee_list.copy()]     
        employee_list.clear()
    employee_list.append(line)
f.close()
This is the actual code that does the edits (with the above mentioned replacement_terms on line 2). I think the script should be fairly explainable. I'm confident it's far from most optimized. However, it works decently well and does not use any imports.

Another thing to note, with clean.csv, it'll rewrite your old data; if desired to append information as you go (rather than destruct), you can remove lines:
# Let's write the CSV header
f = open("clean.csv", "w")
for key in replacement_terms.keys():
    if key != "\n":
        f.write(key + ",")
f.write("\n")
f.close()
The "w" means "write" which will forcefully write over the existing .csv; removing this will leave only the appending "a" of the CSV body portion.

I threw in another script, "generate_test_data.py" to generate the exam raw.txt data. This does include an import as it falls outside the original request. However, it's useful to show how this information was generated.

clean.csv 10.05 kB · 1 download main.py 1.22 kB · 3 downloads raw.txt 21.58 kB · 1 download generate_test_data.py 576 B · 0 downloads

Wow that's a lot of effort thank you! You've definitely simplified what I had a lot so thank you!

Sign In

Python - Need to simplify some code

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

I Will NOT Give You $250 for Your Broken Game - WAN Show April 26, 2024

Latest From Tech Quickie:

Why Are Gaming Laptops So Expensive?

Latest From TechLinked:

Good Riddance, TikTok

Latest From GameLinked:

Is Nintendo being FRAMED?

Latest From ShortCircuit:

I tried 20 influencer foods, here are the best… and the worst…

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!