Help parsing HTML into a termbase (General technical issues)

Technical forums » General technical issues »
Help parsing HTML into a termbase
Track this topic

Help parsing HTML into a termbase

Thread poster: Patricia Martin

Patricia Martin
United States
Japanese to English
+ ...

Aug 21, 2013

So I found 2 Japanese standards documents online that will come in handy for my next job.

JIS Z 4001

and

JIS Z 8103

My basic knowledge of copy and pasting into excel and word are producing really nasty formatting errors and line breaks. I also don't really see any clear delimiters or anything I can do to clean this up using my current computer knowledge.

I'm wondering if there are any tricks I can do in HTML to parse these terms into a clean excel xls that I can later turn into a termbase?

I would like to at least get the corresponding terms in both languages.

Japanese (用語) and English (対応英語)

The definitions (定義) would be a welcome bonus as would the numbers (番号) in Z8103 and JIS/ISO numbers in Z4001.

Thanks in advance! ▲ Collapse

Michael Beijer

United Kingdom
Local time: 21:38
Member (2009)
Dutch to English
+ ...

Hi Patricia,

Aug 21, 2013

If I look here (http://kikakurui.com/z8/index.html ), e.g., it looks like the English and Japanese appears on alternate lines.

I would therefore copy/paste the text into 2 different text files and use some sort of trick in a text editor (regex or built in tool) to delete every other line. By doing this to each file, but starting one line down in the second file, you can create two files: (1) in English, (2) in Japanese. Then copy/paste them both into two columns in Excel. Et voilà!

------------------------------------------*
See e.g.:

http://stackoverflow.com/questions/17735289/delete-every-other-line-in-notepad

or:

'In Notepad ++:

1. Using Replace (ctrl-h)
2. Find what:^.*0 \..*$
3. Replace with:(leave empty)
4. Search mode: Check the regular expression radio button' (http://portableapps.com/node/15300 )

[Edited at 2013-08-21 20:40 GMT]

[Edited at 2013-08-21 20:40 GMT]

------------------------------------------*
You could also use HTTrack to download all of the relevant .html pages to your computer: http://www.httrack.com/
------------------------------------------*

Michael

[Edited at 2013-08-21 20:45 GMT] ▲ Collapse

Patricia Martin
United States
Japanese to English
+ ...

TOPIC STARTER

That's handy, but it doesn't solve my problem.

Aug 22, 2013

Thanks Michael.

That helps with getting the titles of the standards into a nice neat excel file and I learned a lot in the process, but it doesn't solve my problem.

Some of those standards are a list of terminology. (the two I listed in my first post is what I really need)

Copy/paste just doesn't work.

I think I'm getting close by saving the html file, opening it in Adobe Acrobat and playing around with conversions to different formats, but I'm not there yet. (Why didn't I major in Computer Science?)

If you look at this picture you can see what I need.

The only thing I can think of would be to use the Text to Columns command in excel, either by using a delimiter, or .... since the spaces are in the same place I remember there was an option to physically click the location to separate into columns using the mouse. I JUST CAN'T FIGURE OUT HOW TO GET THE DATA AS IT APPEARS ON THE WEB PAGE INTO A CLEAN FORMAT IN EXCEL OR WORD.

I should be able to figure it out eventually, and I'm so close it's driving me crazy. Hopefully somebody can help me save my sanity. If not I'll post what I went through to finally solve this. ▲ Collapse

Ambrose Li

Canada
Local time: 16:38
English
+ ...

Terrible formatting

Aug 22, 2013

Patricia Martin wrote:

Copy/paste just doesn't work.

I think I'm getting close by saving the html file, opening it in Adobe Acrobat and playing around with conversions to different formats, but I'm not there yet. (Why didn't I major in Computer Science?)

Oh dear… The HTML there is hideous: There is no table; everything that looks like a table cell is actually a carefully formatted paragraph.

Let me try spending maybe 15 minutes to half an hour and see what I can come up with.

PS: I’m now at the 15 minute mark: Those aren’t even paragraphs. There are single characters that are individually positioned. It’s going to be a huge amount of work trying to make sense of the text.

PPS: I’m now at the 30 minute mark. I think I’m going to give up.

[Edited at 2013-08-22 03:44 GMT]

[Edited at 2013-08-22 04:01 GMT]

Patricia Martin
United States
Japanese to English
+ ...

TOPIC STARTER

Thanks for trying.

Aug 22, 2013

Thanks for even taking the time to do that for an internet stranger, it really is appreciated.

Would going analog be a solution??

If I physically print this out, manually draw some lines for separation, then try scanning it in with an OCR program, would that work?

Is there an OCR program/function that works similar to my previously mentioned Text to Columns excel command where I could distinguish between columns?

Tony M
France
Local time: 22:38
Member
French to English
+ ...

SITE LOCALIZER

Utilities for stripping HTML tags

Aug 22, 2013

I'm arriving a bit after the party, but I seem to recall having read somewhere in one of these forums that there are utilities available for stripping out unwanted HTML tags; I'd have thought that would probably be a good place to start; afterwards, formatting it into your columns etc. ought to be less of a hassle.

Once the tags are stripped out, are there any other consistent 'separators' that will indicate where to break your terms? I was thinking of things like X number of spaces, etc. It would then be relatively easy to search for that, and replace it with something like a tab character to enable you to then do a table conversion.

Yes, just did a quick Google on "strip html tags", and there appears to be a lot of information out there that might help you — including an on-line resource, I noticed, which might be good for a quick try-out. ▲ Collapse

Ambrose Li

Canada
Local time: 16:38
English
+ ...

That might work

Aug 22, 2013

Yes, that might work.

I wonder if they did this on purpose, since on the web site’s front page there seem to be a note about not allowing people to download the content. But don’t they have disability legislation in Japan? This kind of hideous HTML is going to drive blind people crazy.

Ambrose Li

Canada
Local time: 16:38
English
+ ...

Stripping HTML

Aug 22, 2013

Tony M wrote:

I'm arriving a bit after the party, but I seem to recall having read somewhere in one of these forums that there are utilities available for stripping out unwanted HTML tags; I'd have thought that would probably be a good place to start; afterwards, formatting it into your columns etc. ought to be less of a hassle.

Once the tags are stripped out, are there any other consistent 'separators' that will indicate where to break your terms? I was thinking of things like X number of spaces, etc. It would then be relatively easy to search for that, and replace it with something like a tab character to enable you to then do a table conversion.

After my 30-minute attempt I can conclusively answer your question with a “No”. Those are not properly constructed web pages. What you see on the page are bits and pieces of the content (which can be as small as a single character) pasted randomly in place to make it visually look like a table. There are no separators. (Or rather, there are too many separators and they aren’t logical.) If you just strip out the HTML, you will just end up with a jumbled mess.

What needs to be done, if you go the “interpret this HTML and get a text file” route, is to read in the HTML, pattern-match the HTML markup and try to reconstruct the table. Once you can reconstruct the table (probably half a day of work at least), then you can think about exporting to CSV.

PS: As I mentioned above, they might have done this on purpose, basically their way of saying “Don’t do it, but I know you will try anyway, so I’ll make this impossible for you to do.”

[Edited at 2013-08-22 07:29 GMT]

Natron
Japan
Local time: 06:38
English to Japanese
+ ...

I don't think there's an easy way for you to automate this.

Aug 22, 2013

I do think this site might be of dubious legality, maybe ask the client if they have a digital copy of these standards?

You can always just search the site and start making the termbase from scratch. Probably easier than trying to OCR however many sheets this would come out to be.

Props to anyone who could solve this.

[Edited at 2013-08-22 07:55 GMT]

Michael Beijer

United Kingdom
Local time: 21:38
Member (2009)
Dutch to English
+ ...

Web data extraction tools & ticks

Aug 22, 2013

Patricia Martin wrote:

Thanks Michael.

That helps with getting the titles of the standards into a nice neat excel file and I learned a lot in the process, but it doesn't solve my problem.

Some of those standards are a list of terminology. (the two I listed in my first post is what I really need)

Copy/paste just doesn't work.

I think I'm getting close by saving the html file, opening it in Adobe Acrobat and playing around with conversions to different formats, but I'm not there yet. (Why didn't I major in Computer Science?)

If you look at this picture you can see what I need.

The only thing I can think of would be to use the Text to Columns command in excel, either by using a delimiter, or .... since the spaces are in the same place I remember there was an option to physically click the location to separate into columns using the mouse. I JUST CAN'T FIGURE OUT HOW TO GET THE DATA AS IT APPEARS ON THE WEB PAGE INTO A CLEAN FORMAT IN EXCEL OR WORD.

I should be able to figure it out eventually, and I'm so close it's driving me crazy. Hopefully somebody can help me save my sanity. If not I'll post what I went through to finally solve this.

Hi Patricia,

Exactly which pages are you trying to extract the data from? In your first post you mentioned:

http://kikakurui.com/z8/Z8103-2000-01.html + http://kikakurui.com/z4/Z4001-1999-02.html

However, these look nothing like the picture you then showed us in your following post: http://imgur.com/JZTAX6K

Incidentally, for playing around with Excel files and all kinds of magical things with conversions and columns and rows, etc., I can highly recommend: http://www.asap-utilities.com/

Let me know exactly which html pages you are talking about and I will see what I can do.

You might be a mere internet stranger, but you happen to have asked me something that I (a) spend a lot of time doing, and (b) enjoy. See my website for examples of what you can achieve with various data extraction tools and tricks: http://wordbook.nl/wordbook.html + http://wordbook.nl/content/

Michael

Patricia Martin
United States
Japanese to English
+ ...

TOPIC STARTER

Client has physical, but not digital copies.

Aug 23, 2013

I checked with the client and they have the physical copies of these standards and offered to send them, but I think scanning all those pages would take way too much time. They also said digital copies would probably cost extra, but I'm not sure if that's true or not. I still think there's probably a way I could kind of do a "digital OCR" job after converting the html to pdf in acrobat, but I'm not sure on the best way to go about this.

Michael, the picture I uploaded is about the 5th page down from JIS Z 4001. That's where the list of terms starts and it goes all the way down to page 204. There's also terms in the following appendix A from page 205 to 216.

JIS Z 8103 is shorter and the terms go from page 2 to 17. ▲ Collapse

Michael Beijer

United Kingdom
Local time: 21:38
Member (2009)
Dutch to English
+ ...

Looking for coder/regex wizard, preferably fluent in English and Japanese.

Aug 23, 2013

Hi Patricia,

I understand now. I had missed the fact that there was actually a scroll bar.

Hmm. I really wish I could read Japanese because this doesn't look as impossible as it might seem. I downloaded the HTML file to my computer and it looks like this:

Wordbook.nl

It really ought to be possible with a few text editing tricks. Remove some HTML here and there. Delete every other line, or every 3rd/4th line. Copy/paste a bit and maybe do some moving around in Excel (with ASAP Utilities), etc. and it ought to be possible.

I think you need someone who (1) speaks Japanese, (2) is good at text editing, regex, etc.

Perhaps ask over at the memoQ, DVX, or CafeTran mailing lists:

http://tech.groups.yahoo.com/group/memoQ/
http://tech.groups.yahoo.com/group/dejavu-l/
https://groups.google.com/forum/?fromgroups=#!forum/cafetranslators

I bet there is someone there that will be able to do this.

Michael

Rolf Keller
Germany
Local time: 22:38
English to German

A Word macro should do the job work

Aug 24, 2013

Ambrose Li wrote:

Those aren’t even paragraphs. There are single characters that are individually positioned.

I had a quick look at the first document's (JIS Z 4001) HTML code. It seems as if any entry consists of some simple HTML-paragraphs . Example (Proz hides the HTML codes here):

10009 830
核物質
原料物質及び特殊核分裂性物質の総称。まれには
鉱石及び鉱石廃棄物もいう。
備考我が国では，鉱石及び鉱石廃棄物は含め
ない。
nuclear material

Provided that the entries as such are separable (e. g. with the help of the entry numbers, 10009 in the example), it is simple to write an MS Word macro that creates a clean table. The HTML text must be imported into Word as plain text, so that the macro can discern the HTML elements. But of course, "simple" doesn't mean "within some minutes". And one need to be able to read Japanese, thus I can't do the job, sorry.

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon	[Call to this topic]

You can also contact site staff by submitting a support request »

Help parsing HTML into a termbase

Forum rules

Help and orientation

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

TM-Town
Manage your TMs and Terms ... and boost your translation business Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Help parsing HTML into a termbase

Help parsing HTML into a termbase

You have native languages that can be verified

Your current localization setting

Select a language