Jump to content

Sed command not working

babadoctor

So I want to cut out certain portions of a text file whenever a pattern is matched. The pattern would be to cut anything between https://www.twitch.tv/videos/ and ". The number of digits between those two strings would be a set amount, 9. (I don't think it will ever reach 10 digits. It probably wont, at least in the next 10 years.) I would then set that number as a variable, and use it in a separate command.

I tried running this:

 

cat output | sed -e 's^https://www.twitch.tv/videos/\(.*\)"^\1^' 

But it didn't work! The command may look weird to you, but I needed to use a different delimiter instead of the default /, due to the actual pattern having /'s in it. 

 

All it does is print out everything in the file, without running it through sed. 

 

I got the command from this stackoverflow post:

https://stackoverflow.com/questions/13242469/how-to-use-sed-grep-to-extract-text-between-two-words (the answer)

As I said, I replaced the default delimiter with ^. It should work, as you can replace the delimiter with anything you want... but as I said, it didn't work.

 

Any ideas?

 

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

Use the -E flag to use extended regexp syntax, and get rid of the backslashes around your parentheses.  Then use special \1 escape character in place of a replacement string to print out the found match for the capturing parentheses.  (\1-\9 refer to matched sub-strings; with capturing parentheses they can refer to what the parentheses found)

 

sed -E 's_https://www.twitch.tv/videos/(.*)"_\1_'

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Azgoth 2 said:

Use the -E flag to use extended regexp syntax, and get rid of the backslashes around your parentheses.  Then use special \1 escape character in place of a replacement string to print out the found match for the capturing parentheses.  (\1-\9 refer to matched sub-strings; with capturing parentheses they can refer to what the parentheses found)

 

sed -E 's_https://www.twitch.tv/videos/(.*)"_\1_'

It still doesn't seem to work... Am I piping the text into the command incorrectly? :(

This is the command:

cat output | sed -E 's_https://www.twitch.tv/videos/(.*)"_\1_'

This is the output:

https://hastebin.com/izezororem.scala

Original file:

https://hastebin.com/cuvudamisu.rb

 

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

Ah, i see what's happening.  I was testing this on a single random twitch.tv video URL--in your text it's replacing the url with just what comes after the /video/ part.  Sed is really meant for manipulating text--for just matching substrings, you'll want (g)awk.

 

gawk 'match($0, /https:\/\/www\.twitch\.tv\/videos\/([^\"]*)\",/, arr) {print arr[1]}' output

Regular expressions in (g)awk are surrounded by / characters, s there's a lot of ugly escaping.  Quotation marks are also escaped because they normally represent literal string delimiters.

Link to comment
Share on other sites

Link to post
Share on other sites

48 minutes ago, Azgoth 2 said:

Ah, i see what's happening.  I was testing this on a single random twitch.tv video URL--in your text it's replacing the url with just what comes after the /video/ part.  Sed is really meant for manipulating text--for just matching substrings, you'll want (g)awk.

 


gawk 'match($0, /https:\/\/www\.twitch\.tv\/videos\/([^\"]*)\",/, arr) {print arr[1]}' output

Regular expressions in (g)awk are surrounded by / characters, s there's a lot of ugly escaping.  Quotation marks are also escaped because they normally represent literal string delimiters.

It works! Great! Now all I need to figure out is how to make this detect all patterns in the file, and not stop after only detecting one pattern...

Do you have any ideas? (I am thinking of writing each argument to a variable, possibly with xargs -n1?)

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

As long as you don't have multiple matches per line to worry about that should work--it'll print each match to a new line in stdout, based on my quick tests.  Admittedly I don't use awk/gawk much so I wasn't aware it had issues with multiple matches per line until I looked it up just now.  Frankly I'm starting to lean towards just writing a script in Python or whatever language you're comfortable with to do the matching for you:

#!/usr/bin/python3
import re
f = open("/path/to/file", "r").read()
for i in re.findall('https://www\.twitch\.tv/videos/(.*?)",', f):
  print(i)

# or for multiple files in a single directory
import os
import re
files = [i.path for i in os.scandir("/path/to/files")]
for F in files:
  f = open(F, "r").read()
  for i in re.findall('https://www\.twitch\.tv/videos/(.*?)",', f):
    print(i)

That has no issue matching across multiple lines or multiple matches within a line, again based on some of my quick and hacky testing.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×