Help parsing HTML into a termbase Thread poster: Patricia Martin
|
So I found 2 Japanese standards documents online that will come in handy for my next job.
JIS Z 4001
and
JIS Z 8103
My basic knowledge of copy and pasting into excel and word are producing really nasty formatting errors and line breaks. I also don't really see any clear delimiters or anything I... See more So I found 2 Japanese standards documents online that will come in handy for my next job.
JIS Z 4001
and
JIS Z 8103
My basic knowledge of copy and pasting into excel and word are producing really nasty formatting errors and line breaks. I also don't really see any clear delimiters or anything I can do to clean this up using my current computer knowledge.
I'm wondering if there are any tricks I can do in HTML to parse these terms into a clean excel xls that I can later turn into a termbase?
I would like to at least get the corresponding terms in both languages.
Japanese (用語) and English (対応英語)
The definitions (定義) would be a welcome bonus as would the numbers (番号) in Z8103 and JIS/ISO numbers in Z4001.
Thanks in advance! ▲ Collapse | | | Michael Beijer United Kingdom Local time: 17:20 Member (2009) Dutch to English + ... Hi Patricia, | Aug 21, 2013 |
If I look here (http://kikakurui.com/z8/index.html ), e.g., it looks like the English and Japanese appears on alternate lines.
I would therefore copy/paste the text into 2 different text files and use some sort of trick in a text editor (regex or built in tool) to delete every other line. By doing this to each file, but starting one line down in the second file, you can create two files: ... See more If I look here (http://kikakurui.com/z8/index.html ), e.g., it looks like the English and Japanese appears on alternate lines.
I would therefore copy/paste the text into 2 different text files and use some sort of trick in a text editor (regex or built in tool) to delete every other line. By doing this to each file, but starting one line down in the second file, you can create two files: (1) in English, (2) in Japanese. Then copy/paste them both into two columns in Excel. Et voilà!
------------------------------------------*
See e.g.:
http://stackoverflow.com/questions/17735289/delete-every-other-line-in-notepad
or:
'In Notepad ++:
1. Using Replace (ctrl-h)
2. Find what:^.*0 \..*$
3. Replace with:(leave empty)
4. Search mode: Check the regular expression radio button' (http://portableapps.com/node/15300 )
[Edited at 2013-08-21 20:40 GMT]
[Edited at 2013-08-21 20:40 GMT]
------------------------------------------*
You could also use HTTrack to download all of the relevant .html pages to your computer: http://www.httrack.com/
------------------------------------------*
Michael
[Edited at 2013-08-21 20:45 GMT] ▲ Collapse | | | That's handy, but it doesn't solve my problem. | Aug 22, 2013 |
Thanks Michael.
That helps with getting the titles of the standards into a nice neat excel file and I learned a lot in the process, but it doesn't solve my problem.
Some of those standards are a list of terminology. (the two I listed in my first post is what I really need)
Copy/paste just doesn't work.
I think I'm getting close by saving the html file, opening it in Adobe Acrobat and playing around with conversions to different formats, but I... See more Thanks Michael.
That helps with getting the titles of the standards into a nice neat excel file and I learned a lot in the process, but it doesn't solve my problem.
Some of those standards are a list of terminology. (the two I listed in my first post is what I really need)
Copy/paste just doesn't work.
I think I'm getting close by saving the html file, opening it in Adobe Acrobat and playing around with conversions to different formats, but I'm not there yet. (Why didn't I major in Computer Science?)
If you look at this picture you can see what I need.
The only thing I can think of would be to use the Text to Columns command in excel, either by using a delimiter, or .... since the spaces are in the same place I remember there was an option to physically click the location to separate into columns using the mouse. I JUST CAN'T FIGURE OUT HOW TO GET THE DATA AS IT APPEARS ON THE WEB PAGE INTO A CLEAN FORMAT IN EXCEL OR WORD.
I should be able to figure it out eventually, and I'm so close it's driving me crazy. Hopefully somebody can help me save my sanity. If not I'll post what I went through to finally solve this. ▲ Collapse | | | Terrible formatting | Aug 22, 2013 |
Patricia Martin wrote:
Copy/paste just doesn't work.
I think I'm getting close by saving the html file, opening it in Adobe Acrobat and playing around with conversions to different formats, but I'm not there yet. (Why didn't I major in Computer Science?)
Oh dear… The HTML there is hideous: There is no table; everything that looks like a table cell is actually a carefully formatted paragraph.
Let me try spending maybe 15 minutes to half an hour and see what I can come up with.
PS: I’m now at the 15 minute mark: Those aren’t even paragraphs. There are single characters that are individually positioned. It’s going to be a huge amount of work trying to make sense of the text.
PPS: I’m now at the 30 minute mark. I think I’m going to give up.
[Edited at 2013-08-22 03:44 GMT]
[Edited at 2013-08-22 04:01 GMT] | |
|
|
Thanks for trying. | Aug 22, 2013 |
Thanks for even taking the time to do that for an internet stranger, it really is appreciated.
Would going analog be a solution??
If I physically print this out, manually draw some lines for separation, then try scanning it in with an OCR program, would that work?
Is there an OCR program/function that works similar to my previously mentioned Text to Columns excel command where I could distinguish between columns? | | | Tony M France Local time: 18:20 Member French to English + ... SITE LOCALIZER Utilities for stripping HTML tags | Aug 22, 2013 |
I'm arriving a bit after the party, but I seem to recall having read somewhere in one of these forums that there are utilities available for stripping out unwanted HTML tags; I'd have thought that would probably be a good place to start; afterwards, formatting it into your columns etc. ought to be less of a hassle.
Once the tags are stripped out, are there any other consistent 'separators' that will indicate where to break your terms? I was thinking of things like X number of spaces... See more I'm arriving a bit after the party, but I seem to recall having read somewhere in one of these forums that there are utilities available for stripping out unwanted HTML tags; I'd have thought that would probably be a good place to start; afterwards, formatting it into your columns etc. ought to be less of a hassle.
Once the tags are stripped out, are there any other consistent 'separators' that will indicate where to break your terms? I was thinking of things like X number of spaces, etc. It would then be relatively easy to search for that, and replace it with something like a tab character to enable you to then do a table conversion.
Yes, just did a quick Google on "strip html tags", and there appears to be a lot of information out there that might help you — including an on-line resource, I noticed, which might be good for a quick try-out. ▲ Collapse | | | That might work | Aug 22, 2013 |
Yes, that might work.
I wonder if they did this on purpose, since on the web site’s front page there seem to be a note about not allowing people to download the content. But don’t they have disability legislation in Japan? This kind of hideous HTML is going to drive blind people crazy. | | | Stripping HTML | Aug 22, 2013 |
Tony M wrote:
I'm arriving a bit after the party, but I seem to recall having read somewhere in one of these forums that there are utilities available for stripping out unwanted HTML tags; I'd have thought that would probably be a good place to start; afterwards, formatting it into your columns etc. ought to be less of a hassle.
Once the tags are stripped out, are there any other consistent 'separators' that will indicate where to break your terms? I was thinking of things like X number of spaces, etc. It would then be relatively easy to search for that, and replace it with something like a tab character to enable you to then do a table conversion.
After my 30-minute attempt I can conclusively answer your question with a “No”. Those are not properly constructed web pages. What you see on the page are bits and pieces of the content (which can be as small as a single character) pasted randomly in place to make it visually look like a table. There are no separators. (Or rather, there are too many separators and they aren’t logical.) If you just strip out the HTML, you will just end up with a jumbled mess.
What needs to be done, if you go the “interpret this HTML and get a text file” route, is to read in the HTML, pattern-match the HTML markup and try to reconstruct the table. Once you can reconstruct the table (probably half a day of work at least), then you can think about exporting to CSV.
PS: As I mentioned above, they might have done this on purpose, basically their way of saying “Don’t do it, but I know you will try anyway, so I’ll make this impossible for you to do.”
[Edited at 2013-08-22 07:29 GMT] | |
|
|
Natron Japan Local time: 02:20 English to Japanese + ... I don't think there's an easy way for you to automate this. | Aug 22, 2013 |
I do think this site might be of dubious legality, maybe ask the client if they have a digital copy of these standards?
You can always just search the site and start making the termbase from scratch. Probably easier than trying to OCR however many sheets this would come out to be.
Props to anyone who could solve this.
[Edited at 2013-08-22 07:55 GMT] | | | Michael Beijer United Kingdom Local time: 17:20 Member (2009) Dutch to English + ... Web data extraction tools & ticks | Aug 22, 2013 |
Patricia Martin wrote:
Thanks Michael.
That helps with getting the titles of the standards into a nice neat excel file and I learned a lot in the process, but it doesn't solve my problem.
Some of those standards are a list of terminology. (the two I listed in my first post is what I really need)
Copy/paste just doesn't work.
I think I'm getting close by saving the html file, opening it in Adobe Acrobat and playing around with conversions to different formats, but I'm not there yet. (Why didn't I major in Computer Science?)
If you look at this picture you can see what I need.
The only thing I can think of would be to use the Text to Columns command in excel, either by using a delimiter, or .... since the spaces are in the same place I remember there was an option to physically click the location to separate into columns using the mouse. I JUST CAN'T FIGURE OUT HOW TO GET THE DATA AS IT APPEARS ON THE WEB PAGE INTO A CLEAN FORMAT IN EXCEL OR WORD.
I should be able to figure it out eventually, and I'm so close it's driving me crazy. Hopefully somebody can help me save my sanity. If not I'll post what I went through to finally solve this.
Hi Patricia,
Exactly which pages are you trying to extract the data from? In your first post you mentioned:
http://kikakurui.com/z8/Z8103-2000-01.html + http://kikakurui.com/z4/Z4001-1999-02.html
However, these look nothing like the picture you then showed us in your following post: http://imgur.com/JZTAX6K
Incidentally, for playing around with Excel files and all kinds of magical things with conversions and columns and rows, etc., I can highly recommend: http://www.asap-utilities.com/
Let me know exactly which html pages you are talking about and I will see what I can do.
You might be a mere internet stranger, but you happen to have asked me something that I (a) spend a lot of time doing, and (b) enjoy. See my website for examples of what you can achieve with various data extraction tools and tricks: http://wordbook.nl/wordbook.html + http://wordbook.nl/content/
Michael | | | Client has physical, but not digital copies. | Aug 23, 2013 |
I checked with the client and they have the physical copies of these standards and offered to send them, but I think scanning all those pages would take way too much time. They also said digital copies would probably cost extra, but I'm not sure if that's true or not. I still think there's probably a way I could kind of do a "digital OCR" job after converting the html to pdf in acrobat, but I'm not sure on the best way to go about this.
Michael, the picture I uploaded is about the ... See more I checked with the client and they have the physical copies of these standards and offered to send them, but I think scanning all those pages would take way too much time. They also said digital copies would probably cost extra, but I'm not sure if that's true or not. I still think there's probably a way I could kind of do a "digital OCR" job after converting the html to pdf in acrobat, but I'm not sure on the best way to go about this.
Michael, the picture I uploaded is about the 5th page down from JIS Z 4001. That's where the list of terms starts and it goes all the way down to page 204. There's also terms in the following appendix A from page 205 to 216.
JIS Z 8103 is shorter and the terms go from page 2 to 17. ▲ Collapse | | | Michael Beijer United Kingdom Local time: 17:20 Member (2009) Dutch to English + ... Looking for coder/regex wizard, preferably fluent in English and Japanese. | Aug 23, 2013 |
Hi Patricia,
I understand now. I had missed the fact that there was actually a scroll bar.
Hmm. I really wish I could read Japanese because this doesn't look as impossible as it might seem. I downloaded the HTML file to my computer and it looks like this:
It really ought to be possible with a few text editing tricks. Remove some HTML here and there. Delete every other line, or every 3rd/4th line. Copy/paste a bit and maybe do some moving around in Excel (with ASAP Utilities), etc. and it ought to be possible.
I think you need someone who (1) speaks Japanese, (2) is good at text editing, regex, etc.
Perhaps ask over at the memoQ, DVX, or CafeTran mailing lists:
http://tech.groups.yahoo.com/group/memoQ/
http://tech.groups.yahoo.com/group/dejavu-l/
https://groups.google.com/forum/?fromgroups=#!forum/cafetranslators
I bet there is someone there that will be able to do this.
Michael | |
|
|
Rolf Keller Germany Local time: 18:20 English to German A Word macro should do the job work | Aug 24, 2013 |
Ambrose Li wrote:
Those aren’t even paragraphs. There are single characters that are individually positioned.
I had a quick look at the first document's (JIS Z 4001) HTML code. It seems as if any entry consists of some simple HTML-paragraphs . Example (Proz hides the HTML codes here):
10009 830
核物質
原料物質及び特殊核分裂性物質の総称。まれには 鉱石及び鉱石廃棄物もいう。
備考 我が国では,鉱石及び鉱石廃棄物は含め
ない。
nuclear material
Provided that the entries as such are separable (e. g. with the help of the entry numbers, 10009 in the example), it is simple to write an MS Word macro that creates a clean table. The HTML text must be imported into Word as plain text, so that the macro can discern the HTML elements. But of course, "simple" doesn't mean "within some minutes". And one need to be able to read Japanese, thus I can't do the job, sorry. | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Help parsing HTML into a termbase Pastey | Your smart companion app
Pastey is an innovative desktop application that bridges the gap between human expertise and artificial intelligence. With intuitive keyboard shortcuts, Pastey transforms your source text into AI-powered draft translations.
Find out more » |
| Trados Business Manager Lite | Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |