Ryan Slade | 8 Aug 2012 10:24
Picon

HTML Parser

Hi


Any recomendations for an HTML parser? I found this answer on Stack Overflow:
http://stackoverflow.com/questions/11080936/extract-links-from-a-web-page-using-go-lang

Anyone here know the status of the standard package or can recommend any alternatives? I'd just like to know my options before choosing one.

Thanks
Ryan

Chris Broadfoot | 8 Aug 2012 10:25
Picon
Gravatar

Re: HTML Parser

Check this thread from not too long ago:

https://groups.google.com/forum/?fromgroups#!topic/golang-nuts/3UxkgU_towg

On Wed, Aug 8, 2012 at 6:24 PM, Ryan Slade <ryanslade-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Hi

Any recomendations for an HTML parser? I found this answer on Stack Overflow:

Anyone here know the status of the standard package or can recommend any alternatives? I'd just like to know my options before choosing one.

Thanks
Ryan


Patrick Mylund Nielsen | 8 Aug 2012 10:34
Favicon
Gravatar

Re: HTML Parser

exp/html is great:  http://tip.golang.org/pkg/exp/html/ 

On Wed, Aug 8, 2012 at 10:24 AM, Ryan Slade <ryanslade-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Hi

Any recomendations for an HTML parser? I found this answer on Stack Overflow:

Anyone here know the status of the standard package or can recommend any alternatives? I'd just like to know my options before choosing one.

Thanks
Ryan


Jeremy Wall | 8 Aug 2012 16:59
Picon
Favicon

Re: HTML Parser

I'm the author of go-html-transform and it mostly works for the stuff
I use it for. If you use it let me know how it went :-)

 That said I'm hoping Nigel will finish the exp/html package soon so I
can stop fixing bugs and maintaining h5 and use his instead. I'm not
sure how close to completion the standard package is. Maybe nigel can
chime on the thread and let us know.

On Wed, Aug 8, 2012 at 3:24 AM, Ryan Slade <ryanslade@...> wrote:
> Hi
>
> Any recomendations for an HTML parser? I found this answer on Stack
> Overflow:
> http://stackoverflow.com/questions/11080936/extract-links-from-a-web-page-using-go-lang
>
> Anyone here know the status of the standard package or can recommend any
> alternatives? I'd just like to know my options before choosing one.
>
> Thanks
> Ryan
>

Andy Balholm | 8 Aug 2012 17:32
Picon

Re: HTML Parser

On Wednesday, August 8, 2012 7:59:14 AM UTC-7, Jeremy Wall wrote:

 That said I'm hoping Nigel will finish the exp/html package soon so I
can stop fixing bugs and maintaining h5 and use his instead. I'm not
sure how close to completion the standard package is. Maybe nigel can
chime on the thread and let us know.
Nigel and I have been working on getting the exp/html package to pass the parser test suite. There are only five failing tests left, so it probably won't be long till it passes. I'm not sure how much more work it will take after passing the tests before the package is taken out of exp, but it is quite usable right now—just no guarantee of API stability.

$ grep -r '^FAIL' testlogs
testlogs/tests_innerHTML_1.dat.log:FAIL "<textarea><option>"
testlogs/tests_innerHTML_1.dat.log:FAIL "</html><!--abc-->"
testlogs/webkit01.dat.log:FAIL "<ul><li><div id='foo'/>A</li><li>B<div>C</div></li></ul>"
testlogs/webkit02.dat.log:FAIL "<html><body><img src=\"\" border=\"0\" alt=\"><div>A</div></body></html>"
testlogs/webkit02.dat.log:FAIL "<isindex action=\"x\">"
 
Jeremy Wall | 8 Aug 2012 18:00
Picon
Favicon

Re: HTML Parser

I look forward to your announcement :-)

On Wed, Aug 8, 2012 at 10:32 AM, Andy Balholm <andybalholm@...> wrote:
> On Wednesday, August 8, 2012 7:59:14 AM UTC-7, Jeremy Wall wrote:
>>
>>  That said I'm hoping Nigel will finish the exp/html package soon so I
>> can stop fixing bugs and maintaining h5 and use his instead. I'm not
>> sure how close to completion the standard package is. Maybe nigel can
>> chime on the thread and let us know.
>
> Nigel and I have been working on getting the exp/html package to pass the
> parser test suite. There are only five failing tests left, so it probably
> won't be long till it passes. I'm not sure how much more work it will take
> after passing the tests before the package is taken out of exp, but it is
> quite usable right now—just no guarantee of API stability.
>
> $ grep -r '^FAIL' testlogs
> testlogs/tests_innerHTML_1.dat.log:FAIL "<textarea><option>"
> testlogs/tests_innerHTML_1.dat.log:FAIL "</html><!--abc-->"
> testlogs/webkit01.dat.log:FAIL "<ul><li><div
> id='foo'/>A</li><li>B<div>C</div></li></ul>"
> testlogs/webkit02.dat.log:FAIL "<html><body><img src=\"\" border=\"0\"
> alt=\"><div>A</div></body></html>"
> testlogs/webkit02.dat.log:FAIL "<isindex action=\"x\">"
>

Nigel Tao | 9 Aug 2012 02:10
Favicon

Re: HTML Parser

On 9 August 2012 01:32, Andy Balholm <andybalholm@...> wrote:
> Nigel and I have been working on getting the exp/html package to pass the
> parser test suite. There are only five failing tests left, so it probably
> won't be long till it passes.

I'd like to take this opportunity to thank Andy for all the work he's
done on exp/html. We wouldn't be anywhere near this close to passing
100% of the html5lib / webkit test suite without him.

> I'm not sure how much more work it will take
> after passing the tests before the package is taken out of exp, but it is
> quite usable right now—just no guarantee of API stability.

It is certainly usable right now. Moving out of exp would mean
freezing the API, and I don't think the API is quite right yet.
Specifically, html.Node is currently a struct type; I think it needs
to be an interface type so that programs can provide different
implementations according to their needs. For example, a simple
"scrape the links from this html file" would probably be happy with
the default node implementation. Someone trying to implement a
full-blown browser would probably need nodes to contain fields to
support layout and JavaScript access, but package (exp/)html shouldn't
have to mandate a particular css or javascript implementation.

There may be other performance-related API changes. The html.Attribute
type should probably use the exp/html/atom mechanism. Namespace
representation might also change.

matt | 8 Mar 2013 14:16
Picon

Re: HTML Parser

What happened to exp/html? I can't see it on the tip?

On Thursday, 9 August 2012 01:10:57 UTC+1, Nigel Tao wrote:

On 9 August 2012 01:32, Andy Balholm <andyb... <at> gmail.com> wrote:
> Nigel and I have been working on getting the exp/html package to pass the
> parser test suite. There are only five failing tests left, so it probably
> won't be long till it passes.

I'd like to take this opportunity to thank Andy for all the work he's
done on exp/html. We wouldn't be anywhere near this close to passing
100% of the html5lib / webkit test suite without him.


> I'm not sure how much more work it will take
> after passing the tests before the package is taken out of exp, but it is
> quite usable right now—just no guarantee of API stability.

It is certainly usable right now. Moving out of exp would mean
freezing the API, and I don't think the API is quite right yet.
Specifically, html.Node is currently a struct type; I think it needs
to be an interface type so that programs can provide different
implementations according to their needs. For example, a simple
"scrape the links from this html file" would probably be happy with
the default node implementation. Someone trying to implement a
full-blown browser would probably need nodes to contain fields to
support layout and JavaScript access, but package (exp/)html shouldn't
have to mandate a particular css or javascript implementation.

There may be other performance-related API changes. The html.Attribute
type should probably use the exp/html/atom mechanism. Namespace
representation might also change.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Daniel Morsing | 8 Mar 2013 14:29
Picon

Re: HTML Parser

On Fri, Mar 8, 2013 at 2:16 PM, matt <matthew.horsnell@...> wrote:
> What happened to exp/html? I can't see it on the tip?
>

It was moved to the go.net sub-repository

https://code.google.com/p/go/source/browse?repo=net

--

-- 
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.


Gmane