Jump to content

a little back story first : I'm using a script to translate tags on a Japanese version of Deviant art and it has the option to batch add tag translations but setting up the file can take some time and effort.

 

So I'm wanting to speed this process up by using the character lists found on wikipedia which has the lists the names both in English and Japanese. The method i want to use is to copy the HTML into Notepad ++ and isolating them using some regular expression I've gotten this(clicky) far but there doesn't seem to let me copy the marked code i have tried everything so would anyone have an idea either A what i mean, B how to do this, or C a better way to achieve what i want which is a copyable list of the names(see ex 1)

ex 1eng | JPeng | JPeng | JP
Thanks and feel free to as for a rephrase as it's 7am and i've been up allnight on this.
Link to comment
https://linustechtips.com/topic/18652-need-a-little-help-with-notepad/
Share on other sites

Link to post
Share on other sites

I don't really know Notepad++, but seeing as regular expressions are more or less

universal (with some things specific to each implementation) I might be able to

help out anyway.

To make sure I understand this correctly: The regex operation you want to do is

substitute something like this:

<span class="t_nihongo_kanji" lang="ja" xml:lang="ja">[...Japanese characters...]</span><span class="t_nihongo_comma" style="display:none">,</span> <i>[...English translation...]</i>
with
[...Japanese characters...] | [...English translation...]
?

So I would suggest running a search and replace on your text, specifically:

Find:

[...your regular expression from screenshot...]
Replace:
\2 | \1
Or something similar.

EDIT: Sorry, had to drive my neighbour to the emergency room, he's recently

had knee surgery and it's acting up. Anyway, one more thing I wanted to add: I'm

not sure which characters need to be escaped in Notepad's regex engine (the |, for

example), if you're getting errors or unexpected results that is definitely something

I would look into as well.

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Yeah i know how to find the string i want just that i would like to copy it like you can in word when you search for a word it highlights it ready for copying or cutting but notepad++ only lets you copy/cut the line the string you're looking for is in.

Link to post
Share on other sites

Hm. What exactly happens when you do the above search and replace operation? I

currently don't have access to a Windows machine so I can't try it out myself,

but if all else fails I will have one tomorrow where I could play around with it.

Could you maybe post the source code for the page you're trying to parse? That

would allow me to tinker around with it.

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Ok, just as an example: What you're looking for is a new list looking something like

this:

Suzuhara Misaki | 鈴原 みさきMihara Ichirō | 三原 一郎 Kobayashi Hatoko | 小林 鳩子etc.
(I know it would look better like this:

Suzuhara Misaki  | 鈴原 みさきMihara Ichirō    | 三原 一郎 Kobayashi Hatoko | 小林 鳩子etc.
but that will be very tricky I suspect.)

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Ok I'll give it a shot tomorrow.

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Ok, I think I got it.

 

Go to "Search" (Ctrl-F).

Go to the "Mark" tab.

Into the "Find what", enter this regex:

 

 

^.*?xml:lang="ja">(.*?)</span><span class="t_nihongo_comma" style="display:none">,</span> <i>(.*?)</i>.*?$
Check the "Bookmark line" checkbox.

Select "Regular expression" in "Search Mode".

Click "Mark All".

A dialog box should pop up saying "X matches.".

Go to the main menu: "Search"->"Bookmark"->"Remove Unmarked Lines".

Your document should now be reduced to the lines you're interested in.

Go to "Replace" (Ctrl-H).

Into "Find what", enter the same regex as above.

EDIT: Insert the following into "Replace with":

\2 | \1
/EDIT

Select "Regular expression" as your search mode.

Click "Replace All".

A dialog box should pop up saying "X occurrences replaced."

 

When I do this with the following websites source text: [url=http://en.wikipedia.org/wiki/List_of_Angelic_Layer_characters]List of Angelic Characters

 

I get the following result:

 

Suzuhara Misaki | 鈴原 みさきMihara Ichirō | 三原 一郎Kobayashi Hatoko | 小林 鳩子Kobayashi Kōtarō | 小林 虎太郎Kizaki Tamayo | 木崎 珠代Mihara Ōjirō | 三原 王二郎Asami Shōko | 浅見 祥子Suzuhara Shūko | 鈴原 萩子Jōnouchi Sai | 城乃内 最Saitō Kaede | 斉藤 楓Seto Ringo | 瀬戸 林子Fujisaki Madoka | 藤崎 円香Fujisaki Arisu | 藤崎 有栖Ogata Masaharu | 尾形 雅治Fujimori Hiromi | 藤森 ひろみInada Yūko | 稲田 夕子Inada Shūji | 稲田 修二Shikaisha | 司会者Kyōko | 京子Kitamura Asuka | 北村 飛鳥Hikawa Yūko | 氷川 優子Yamada Tomoko | 山田 知子Shibata Maria | 柴田 まりあMisaki Ryō | 岬 了Jōnouchi Rin | 城乃内 鈴Tanaka Chitose | 田中 千歳Tsubasa Makkenjī | つばさ・マッケンジーHikaru | ヒカルSuzuka | 鈴鹿Shirahime | 白姫Buranshe | ブランシェRanga | ランガWizādo | ウィザードAtena | アテナMao | 猫
 

What you hadn't taken into account with your original regex were the characters from the line's beginning

to the pattern's beginning and from the pattern's end to the line's end.

 

I have looked into padding the list to make it look better, but I don't think you can do that with just regex

in NP++. If you want to try around a bit, this looks promising: [url=http://stackoverflow.com/questions/14878571/add-trailing-zeroes-to-line-in-notepad]link.

 

I hope this works for you and is what you need, otherwise feel free to ask more. I won't have access to

the Windows machine (and therefore NP++) tomorrow, but I can try again on Saturday.

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Ah crap, *facepalm*, sorry! Yes, of course, insert the following into the "Replace with" box:

\2 | \1

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

"\1" | "\2"
The \1 holds the Japanese version, and the \2 holds the English version. When you look

at the regular expression:

 

^.*?xml:lang="ja">(.*?)</span><span class="t_nihongo_comma" style="display:none">,</span> <i>(.*?)</i>.*?$
Notice the (.*?) where the Japanese and English matches are. That means that anything

which matches for that part of the regex will be stored in a numbered variable that can

be used with \1, \2, \3 etc. They are numbered in the order they appear in the original

regex, and since the Japanese version comes first, that is stored in \1, and the English

version goes into \2.

Note that in many other regular expression engines, the " would need to be escaped like so:

\"\1\" | \"\2\"
But NP++ apparently doesn't need this. Just if you ever need to use regex on another program

and it doesn't work, an unescaped character is often the source of error.

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Happy to help. :)

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

You mean to get this:

Suzuhara Misaki |鈴原 みさき
instead of this:
Suzuhara Misaki | 鈴原 みさき
?

Simply remove the space from the replacement expression:

"\1" |"\2"
Spaces are translated 1:1 from the "Replace with" expression into the final result.

Or are you talking about removing the spaces from within the JP expressions, so

this:

Mihara Ōjirō | 三原王二郎
instead of this:
Mihara Ōjirō | 三原 王二郎
I think that one would be quite a bit trickier, but I could look into it tomorrow.

I'm not sure if my regex-fu is strong enough for that but I'll give it a shot if

that's the desired result.

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Or are you talking about removing the spaces from within the JP expressions, so
this:
Mihara Ōjirō | 三原王二郎
instead of this:
Mihara Ōjirō | 三原 王二郎
I think that one would be quite a bit trickier, but I could look into it tomorrow.

I'm not sure if my regex-fu is strong enough for that but I'll give it a shot if

that's the desired result.

 

this is what im after if it's easier the end file will be in a .txt

Link to post
Share on other sites

Ok I'll have a look and get back to you.

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Okay im trying something new as i want to get the ENG from a different part of the line and i tried to copy what you did with the last one but it doesn't seem to work for me ( the site: http://en.wikipedia.org/wiki/List_of_Strike_Witches_characters
)

Original line

<dt><span id="Yoshika_Miyafuji">Yoshika Miyafuji</span> <span style="font-weight: normal">(<span class="t_nihongo_kanji" lang="ja" xml:lang="ja">宮藤 芳佳</span><span class="t_nihongo_comma" style="display:none">,</span> <i>Miyafuji Yoshika</i><span class="t_nihongo_help noprint"><sup><a href="/wiki/Help:Installing_Japanese_character_sets" title="Help:Installing Japanese character sets"><span class="t_nihongo_icon" style="color: #00e; font: bold 80% sans-serif; text-decoration: none; padding: 0 .1em;">?</span></a></sup></span>)</span></dt>

 

my attempt

^.*?span id="Yoshika_Miyafuji">(.*?)</span> <span style="font-weight: normal">(<span class="t_nihongo_kanji" lang="ja" xml:lang="ja">(.*?)</span>.*?$

What am i doing wrong?

Link to post
Share on other sites

I am terribly sorry about not getting back sooner; my weekend turned into quite a jumbly

mess (the good kind, but very hectic and exhausting).

Anyway:

 

Found out how to do it by this

 

the fixes

 

replace this--------------\s+(?=\w+":")--------------with this--------$1$2$3--------

That's great!

I will take a look at your new problem later today, promise! ;)

EDIT: Just noticed this while having a quick look at it: You have a ( in your

regex. This is a special character and needs to be escaped. Try this:

 

^.*?span id="Yoshika_Miyafuji">(.*?)</span> <span style="font-weight: normal">\(<span class="t_nihongo_kanji" lang="ja" xml:lang="ja">(.*?)</span>.*?$
When I try this on your sample line, I get:
"Yoshika Miyafuji" | "宮藤 芳佳"

BUILD LOGS: HELIOS - Latest Update: 2015-SEP-06 ::: ZEUS - BOTW 2013-JUN-28 ::: APOLLO - Complete: 2014-MAY-10
OTHER STUFF: Cable Lacing Tutorial ::: What Is ZFS? ::: mincss Primer ::: LSI RAID Card Flashing Tutorial
FORUM INFO: Community Standards ::: The Moderating Team ::: 10TB+ Storage Showoff Topic

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×