Victor Stinner | 13 Jan 02:24 2012

Status of the fix for the hash collision vulnerability

Many people proposed their own idea to fix the vulnerability, but only
3 wrote a patch:

- Glenn Linderman proposes to fix the vulnerability by adding a new
"safe" dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.
- Marc Andre Lemburg proposes to fix the vulnerability directly in
dict (for any key type). The patch raises an exception if a lookup
causes more than 1000 collisions.
- I propose to fix the vulnerability only in the Unicode hash (not for
other types). My patch adds a random secret initialized at startup (it
can be disabled or fixed using an environment variable).

--

I consider that Glenn's proposition is not applicable in practice
because all applications and all libraries have to be patched to use
the new "safe" dict type.

Some people are concerned by possible regression introduced by Marc's
proposition: his patch may raise an exception for legitimate data.

My proposition tries to be "just enough" secure with a low (runtime
performance) overhead. My patch becomes huge (and so backporting is
more complex), whereas Marc's patch is very simple and so trivial to
backport.

--

(Continue reading)

Guido van Rossum | 13 Jan 03:57 2012

Re: Status of the fix for the hash collision vulnerability

Hm... I started out as a big fan of the randomized hash, but thinking more about it, I actually believe that the chances of some legitimate app having >1000 collisions are way smaller than the chances that somebody's code will break due to the variable hashing. In fact we know for a fact that the latter will break code, since it changes the order of items in a dict. This affects many tests written without this in mind, and I assume there will be some poor sap out there who uses Python's hash() function to address some external persistent hash table or some other external datastructure. How pathological the data needs to be before the collision counter triggers? I'd expect *very* pathological.

This is depending on how the counting is done (I didn't look at MAL's patch), and assuming that increasing the hash table size will generally reduce collisions if items collide but their hashes are different.

That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable.

--Guido

On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner <victor.stinner <at> haypocalc.com> wrote:
Many people proposed their own idea to fix the vulnerability, but only
3 wrote a patch:

- Glenn Linderman proposes to fix the vulnerability by adding a new
"safe" dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.
- Marc Andre Lemburg proposes to fix the vulnerability directly in
dict (for any key type). The patch raises an exception if a lookup
causes more than 1000 collisions.
- I propose to fix the vulnerability only in the Unicode hash (not for
other types). My patch adds a random secret initialized at startup (it
can be disabled or fixed using an environment variable).

--

I consider that Glenn's proposition is not applicable in practice
because all applications and all libraries have to be patched to use
the new "safe" dict type.

Some people are concerned by possible regression introduced by Marc's
proposition: his patch may raise an exception for legitimate data.

My proposition tries to be "just enough" secure with a low (runtime
performance) overhead. My patch becomes huge (and so backporting is
more complex), whereas Marc's patch is very simple and so trivial to
backport.

--

It is still unclear to me if the fix should be enabled by default for
Python < 3.3. Because the overhead (of my patch) is low, I would
prefer to enable the fix by default, to protect everyone with a simple
Python upgrade.

I prefer to explain how to disable explicitly the randomized hash
(PYTHONHASHSEED=0) (or how to fix application bugs) to people having
troubles with randomized hash, instead of leaving the hole open by
default.

--

We might change hash() for types other than str, but it looks like web
servers are only concerned by dict with string keys.

We may use Paul's hash function if mine is not enough secure.

My patch doesn't fix the DoS, it just make the attack more complex.
The attacker cannot pregenerate data for an attack: (s)he has first to
compute the hash secret, and then compute hash collisions using the
secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
system). So I hope that computing collisions requires a lot of CPU
time (is slow) to make the attack ineffective with today computers.

--

I plan to write a nice patch for Python 3.3, then write a simpler
patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged,
maybe don't create a new random.c file, maybe don't touch the test
suite while the patch breaks many tests), and finally write patches
for Python 2.6 and 2.7.

Details about my patch:

- I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits)
- a new PYTHONSEED environment variable allow to control the
randomized hash: PYTHONSEED=0 disables completly the randomized hash
(restore the previous behaviour), PYTHONSEED=value uses a fixed seed
for processes sharing data and needind same hash values
(multiprocessing users?)
- no overhead on hash(str)
- no startup overhead on Linux
- startup overhead is 10% on Windows (see the issue, I propose another
solution with a startup overhead of 1%)

The patch is not done, some tests are still failing because of the
randomized hash.

--

FYI, PHP released a version 5.3.9 adding "max_input_vars directive to
prevent attacks based on hash collisions (CVE-2011-4885)".

Victor
_______________________________________________
Python-Dev mailing list
Python-Dev <at> python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org



--
--Guido van Rossum (python.org/~guido)
<div>
<p>Hm... I started out as a big fan of the randomized hash, but thinking more about it, I actually believe that the chances of some legitimate app having &gt;1000 collisions are way smaller than the chances that somebody's code will break due to the variable hashing. In fact we know for a fact that the latter will break code, since it changes the order of items in a dict. This affects many  tests written without this in mind, and I assume there will be some poor sap out there who uses Python's hash() function to address some external persistent hash table or some other external datastructure. How pathological the data needs to be before the collision counter triggers? I'd expect *very* pathological.<br><br>This is depending on how the counting is done (I didn't look at MAL's patch), and assuming that increasing the hash table size will generally reduce collisions if items collide but their hashes are different.<br><br>That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable.<br><br>--Guido<br><br></p>
<div class="gmail_quote">On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner <span dir="ltr">&lt;<a href="mailto:victor.stinner <at> haypocalc.com">victor.stinner <at> haypocalc.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">Many people proposed their own idea to fix the vulnerability, but only<br>
3 wrote a patch:<br><br>
- Glenn Linderman proposes to fix the vulnerability by adding a new<br>
"safe" dict type (only accepting string keys). His proof-of-concept<br>
(SafeDict.py) uses a secret of 64 random bits and uses it to compute<br>
the hash of a key.<br>
- Marc Andre Lemburg proposes to fix the vulnerability directly in<br>
dict (for any key type). The patch raises an exception if a lookup<br>
causes more than 1000 collisions.<br>
- I propose to fix the vulnerability only in the Unicode hash (not for<br>
other types). My patch adds a random secret initialized at startup (it<br>
can be disabled or fixed using an environment variable).<br><br>
--<br><br>
I consider that Glenn's proposition is not applicable in practice<br>
because all applications and all libraries have to be patched to use<br>
the new "safe" dict type.<br><br>
Some people are concerned by possible regression introduced by Marc's<br>
proposition: his patch may raise an exception for legitimate data.<br><br>
My proposition tries to be "just enough" secure with a low (runtime<br>
performance) overhead. My patch becomes huge (and so backporting is<br>
more complex), whereas Marc's patch is very simple and so trivial to<br>
backport.<br><br>
--<br><br>
It is still unclear to me if the fix should be enabled by default for<br>
Python &lt; 3.3. Because the overhead (of my patch) is low, I would<br>
prefer to enable the fix by default, to protect everyone with a simple<br>
Python upgrade.<br><br>
I prefer to explain how to disable explicitly the randomized hash<br>
(PYTHONHASHSEED=0) (or how to fix application bugs) to people having<br>
troubles with randomized hash, instead of leaving the hole open by<br>
default.<br><br>
--<br><br>
We might change hash() for types other than str, but it looks like web<br>
servers are only concerned by dict with string keys.<br><br>
We may use Paul's hash function if mine is not enough secure.<br><br>
My patch doesn't fix the DoS, it just make the attack more complex.<br>
The attacker cannot pregenerate data for an attack: (s)he has first to<br>
compute the hash secret, and then compute hash collisions using the<br>
secret. The hash secret is a least 64 bits long (128 bits on a 64 bit<br>
system). So I hope that computing collisions requires a lot of CPU<br>
time (is slow) to make the attack ineffective with today computers.<br><br>
--<br><br>
I plan to write a nice patch for Python 3.3, then write a simpler<br>
patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged,<br>
maybe don't create a new random.c file, maybe don't touch the test<br>
suite while the patch breaks many tests), and finally write patches<br>
for Python 2.6 and 2.7.<br><br>
Details about my patch:<br><br>
- I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits)<br>
- a new PYTHONSEED environment variable allow to control the<br>
randomized hash: PYTHONSEED=0 disables completly the randomized hash<br>
(restore the previous behaviour), PYTHONSEED=value uses a fixed seed<br>
for processes sharing data and needind same hash values<br>
(multiprocessing users?)<br>
- no overhead on hash(str)<br>
- no startup overhead on Linux<br>
- startup overhead is 10% on Windows (see the issue, I propose another<br>
solution with a startup overhead of 1%)<br><br>
The patch is not done, some tests are still failing because of the<br>
randomized hash.<br><span class="HOEnZb"><br>
--<br><br>
FYI, PHP released a version 5.3.9 adding "max_input_vars directive to<br>
prevent attacks based on hash collisions (CVE-2011-4885)".<br><br>
Victor<br>
_______________________________________________<br>
Python-Dev mailing list<br><a href="mailto:Python-Dev <at> python.org">Python-Dev <at> python.org</a><br><a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a><br>
Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/guido%40python.org" target="_blank">http://mail.python.org/mailman/options/python-dev/guido%40python.org</a><br></span>
</blockquote>
</div>
<br><br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
Mark Dickinson | 13 Jan 18:08 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum <guido <at> python.org> wrote:
> How
> pathological the data needs to be before the collision counter triggers? I'd
> expect *very* pathological.

How pathological do you consider the set

   {1 << n for n in range(2000)}

to be?  What about the set:

   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}

?  The > 2000 elements of the latter set have only 61 distinct hash
values on 64-bit machine, so there will be over 2000 total collisions
involved in creating this set (though admittedly only around 30
collisions per hash value).

--

-- 
Mark
Guido van Rossum | 13 Jan 18:43 2012

Re: Status of the fix for the hash collision vulnerability

On Fri, Jan 13, 2012 at 9:08 AM, Mark Dickinson <dickinsm <at> gmail.com> wrote:
On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum <guido <at> python.org> wrote:
> How
> pathological the data needs to be before the collision counter triggers? I'd
> expect *very* pathological.

How pathological do you consider the set

  {1 << n for n in range(2000)}

to be?  What about the set:

  ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}

?  The > 2000 elements of the latter set have only 61 distinct hash
values on 64-bit machine, so there will be over 2000 total collisions
involved in creating this set (though admittedly only around 30
collisions per hash value).

Hm... So how does the collision counting work for this case?

--
--Guido van Rossum (python.org/~guido)
<div>
<div class="gmail_quote">On Fri, Jan 13, 2012 at 9:08 AM, Mark Dickinson <span dir="ltr">&lt;<a href="mailto:dickinsm <at> gmail.com">dickinsm <at> gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<div class="im">On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum &lt;<a href="mailto:guido <at> python.org">guido <at> python.org</a>&gt; wrote:<br>
&gt; How<br>
&gt; pathological the data needs to be before the collision counter triggers? I'd<br>
&gt; expect *very* pathological.<br><br>
</div>How pathological do you consider the set<br><br>
 &nbsp; {1 &lt;&lt; n for n in range(2000)}<br><br>
to be? &nbsp;What about the set:<br><br>
 &nbsp; ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}<br><br>
? &nbsp;The &gt; 2000 elements of the latter set have only 61 distinct hash<br>
values on 64-bit machine, so there will be over 2000 total collisions<br>
involved in creating this set (though admittedly only around 30<br>
collisions per hash value).<span class="HOEnZb"></span><br>
</blockquote>
</div>
<br>Hm... So how does the collision counting work for this case?<br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
Mark Dickinson | 13 Jan 19:13 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum <guido <at> python.org> wrote:
>> How pathological do you consider the set
>>
>>   {1 << n for n in range(2000)}
>>
>> to be?  What about the set:
>>
>>   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}
>>
>> ?  The > 2000 elements of the latter set have only 61 distinct hash
>> values on 64-bit machine, so there will be over 2000 total collisions
>> involved in creating this set (though admittedly only around 30
>> collisions per hash value).
>
> Hm... So how does the collision counting work for this case?

Ah, my bad.  It looks like the ieee754_powers_of_two is safe---IIUC,
it's the number of collisions involved in a single key-set operation
that's limited.  So a dictionary with keys {1<<n for n in range(2000)}
is fine, but a dictionary with keys  {1<<(61*n) for n in range(2000)}
is not:

>>> {1<<(n*61):True for n in range(2000)}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <dictcomp>
KeyError: 'too many hash collisions'
[67961 refs]

I'd still not consider this particularly pathological, though.

--

-- 
Mark
_______________________________________________
Python-Dev mailing list
Python-Dev <at> python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/python-python-dev%40m.gmane.org
Guido van Rossum | 13 Jan 22:22 2012

Re: Status of the fix for the hash collision vulnerability

On Fri, Jan 13, 2012 at 10:13 AM, Mark Dickinson <dickinsm <at> gmail.com> wrote:
On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum <guido <at> python.org> wrote:
>> How pathological do you consider the set
>>
>>   {1 << n for n in range(2000)}
>>
>> to be?  What about the set:
>>
>>   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}
>>
>> ?  The > 2000 elements of the latter set have only 61 distinct hash
>> values on 64-bit machine, so there will be over 2000 total collisions
>> involved in creating this set (though admittedly only around 30
>> collisions per hash value).
>
> Hm... So how does the collision counting work for this case?

Ah, my bad.  It looks like the ieee754_powers_of_two is safe---IIUC,
it's the number of collisions involved in a single key-set operation
that's limited.  So a dictionary with keys {1<<n for n in range(2000)}
is fine, but a dictionary with keys  {1<<(61*n) for n in range(2000)}
is not:

>>> {1<<(n*61):True for n in range(2000)}
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "<stdin>", line 1, in <dictcomp>
KeyError: 'too many hash collisions'
[67961 refs]

I'd still not consider this particularly pathological, though.

Really? Even though you came up with specifically to prove me wrong?

--
--Guido van Rossum (python.org/~guido)
<div>
<div class="gmail_quote">On Fri, Jan 13, 2012 at 10:13 AM, Mark Dickinson <span dir="ltr">&lt;<a href="mailto:dickinsm <at> gmail.com">dickinsm <at> gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<div class="im">On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum &lt;<a href="mailto:guido <at> python.org">guido <at> python.org</a>&gt; wrote:<br>
&gt;&gt; How pathological do you consider the set<br>
&gt;&gt;<br>
&gt;&gt; &nbsp; {1 &lt;&lt; n for n in range(2000)}<br>
&gt;&gt;<br>
&gt;&gt; to be? &nbsp;What about the set:<br>
&gt;&gt;<br>
&gt;&gt; &nbsp; ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}<br>
&gt;&gt;<br>
&gt;&gt; ? &nbsp;The &gt; 2000 elements of the latter set have only 61 distinct hash<br>
&gt;&gt; values on 64-bit machine, so there will be over 2000 total collisions<br>
&gt;&gt; involved in creating this set (though admittedly only around 30<br>
&gt;&gt; collisions per hash value).<br>
&gt;<br>
&gt; Hm... So how does the collision counting work for this case?<br><br>
</div>Ah, my bad. &nbsp;It looks like the ieee754_powers_of_two is safe---IIUC,<br>
it's the number of collisions involved in a single key-set operation<br>
that's limited. &nbsp;So a dictionary with keys {1&lt;&lt;n for n in range(2000)}<br>
is fine, but a dictionary with keys &nbsp;{1&lt;&lt;(61*n) for n in range(2000)}<br>
is not:<br><br>
&gt;&gt;&gt; {1&lt;&lt;(n*61):True for n in range(2000)}<br>
Traceback (most recent call last):<br>
 &nbsp;File "&lt;stdin&gt;", line 1, in &lt;module&gt;<br>
 &nbsp;File "&lt;stdin&gt;", line 1, in &lt;dictcomp&gt;<br>
KeyError: 'too many hash collisions'<br>
[67961 refs]<br><br>
I'd still not consider this particularly pathological, though.</blockquote>
<div>
<br>Really? Even though you came up with specifically to prove me wrong? <br>
</div>
</div>
<br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
Benjamin Peterson | 14 Jan 01:37 2012

Re: Status of the fix for the hash collision vulnerability

2012/1/13 Guido van Rossum <guido <at> python.org>:
> Really? Even though you came up with specifically to prove me wrong?

Coming up with a counterexample now invalidates it?

--

-- 
Regards,
Benjamin
"Martin v. Löwis" | 14 Jan 16:17 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Am 14.01.2012 01:37, schrieb Benjamin Peterson:
> 2012/1/13 Guido van Rossum <guido <at> python.org>:
>> Really? Even though you came up with specifically to prove me wrong?
> 
> Coming up with a counterexample now invalidates it?

There are two concerns here:
- is it possible to come up with an example of constructed values that
  show many collisions in a way that poses a threat? To this, the answer
  is apparently "yes", and the proposed reaction is to hard-limit the
  number of collisions accepted by the implementation.
- then, *assuming* such a limitation is in place: is it possible to come
  up with a realistic application that would break under this
  limitation. Mark's example is no such realistic application, instead,
  it is yet another example demonstrating collisions using constructed
  values (although the specific example would continue to work fine
  even under the limitation).

A valid counterexample would have to come from a real application, or
at least from a scenario that is plausible for a real application.

Regards,
Martin
"Martin v. Löwis" | 14 Jan 16:12 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Am 13.01.2012 18:08, schrieb Mark Dickinson:
> On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum <guido <at> python.org> wrote:
>> How
>> pathological the data needs to be before the collision counter triggers? I'd
>> expect *very* pathological.
> 
> How pathological do you consider the set
> 
>    {1 << n for n in range(2000)}
> 
> to be?  

I think this is not a counter-example for the proposed algorithm (at
least not in the way I think it should be implemented).

Those values may collide on the slot in the set, but they don't collide
on the actual hash value.

So in order to determine whether the collision limit is exceeded, we
shouldn't count colliding slots, but colliding hash values (which we
will all encounter during an insert).

> though admittedly only around 30 collisions per hash value.

I do consider the case of hashing integers with only one bit set
pathological. However, this can be overcome by factoring the magnitude
of the number into the hash as well.

Regards,
Martin
Antoine Pitrou | 14 Jan 02:17 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum <guido <at> python.org> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
> >1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Regards

Antoine.

Guido van Rossum | 14 Jan 02:38 2012

Re: Status of the fix for the hash collision vulnerability

On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <solipsis <at> pitrou.net> wrote:
On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum <guido <at> python.org> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
> >1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.

FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.

--
--Guido van Rossum (python.org/~guido)
<div>
<div class="gmail_quote">On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <span dir="ltr">&lt;<a href="mailto:solipsis <at> pitrou.net">solipsis <at> pitrou.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

On Thu, 12 Jan 2012 18:57:42 -0800<br><div class="im">Guido van Rossum &lt;<a href="mailto:guido <at> python.org">guido <at> python.org</a>&gt; wrote:<br>
</div>
<div class="im">&gt; Hm... I started out as a big fan of the randomized hash, but thinking more<br>
&gt; about it, I actually believe that the chances of some legitimate app having<br>
&gt; &gt;1000 collisions are way smaller than the chances that somebody's code will<br>
&gt; break due to the variable hashing.<br><br>
</div>Breaking due to variable hashing is deterministic: you notice it as<br>
soon as you upgrade (and then you use PYTHONHASHSEED to disable<br>
variable hashing). That seems better than unpredictable breaking when<br>
some legitimate collision chain happens.</blockquote>
</div>
<br>Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.<br><br>FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.<br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
Gregory P. Smith | 14 Jan 02:58 2012

Re: Status of the fix for the hash collision vulnerability


On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum <guido <at> python.org> wrote:
On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <solipsis <at> pitrou.net> wrote:
On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum <guido <at> python.org> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
> >1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.

FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.


Agreed.

Of the three options Victor listed only one is good.

I don't like SafeDict.  -1.  It puts the onerous on the coder to always get everything right with regards to data that came from outside the process never ending up hashed in a non-safe dict or set *anywhere*.  "Safe" needs to be the default option for all hash tables.

I don't like the "too many hash collisions" exception. -1. It provides non-deterministic application behavior for data driven applications with no way for them to predict when it'll happen or where and prepare for it. It may work in practice for many applications but is simply odd behavior.

I do like randomly seeding the hash. +1. This is easy. It can easily be back ported to any Python version.

It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We will provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies.

This approach worked fine for Perl 9 years ago.  https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371

-gps
<div>
<br><div class="gmail_quote">On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum <span dir="ltr">&lt;<a href="mailto:guido <at> python.org">guido <at> python.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<div class="im">
<div class="gmail_quote">On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <span dir="ltr">&lt;<a href="mailto:solipsis <at> pitrou.net" target="_blank">solipsis <at> pitrou.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

On Thu, 12 Jan 2012 18:57:42 -0800<br><div>Guido van Rossum &lt;<a href="mailto:guido <at> python.org" target="_blank">guido <at> python.org</a>&gt; wrote:<br>
</div>
<div>&gt; Hm... I started out as a big fan of the randomized hash, but thinking more<br>
&gt; about it, I actually believe that the chances of some legitimate app having<br>
&gt; &gt;1000 collisions are way smaller than the chances that somebody's code will<br>
&gt; break due to the variable hashing.<br><br>
</div>Breaking due to variable hashing is deterministic: you notice it as<br>
soon as you upgrade (and then you use PYTHONHASHSEED to disable<br>
variable hashing). That seems better than unpredictable breaking when<br>
some legitimate collision chain happens.</blockquote>
</div>
<br>
</div>Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.<br><br>FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.<div class="HOEnZb"><div class="h5"><br clear="all"></div></div>
</blockquote>
<div><br></div>
<div>Agreed.</div>
<div><br></div>
<div>Of the three options Victor listed only one is good.</div>
<div><br></div>
<div>I don't like SafeDict. &nbsp;-1. &nbsp;It puts the onerous on the coder to always get everything right with regards to data that came from outside the process never ending up hashed in a non-safe dict or set *anywhere*. &nbsp;"Safe" needs to be the default option for all hash tables.</div>

<div><br></div>
<div>I don't like the "too many hash collisions" exception. -1. It provides non-deterministic application behavior for data driven applications with no way for them to predict when it'll happen or where and prepare for it. It may work in practice for many applications but is simply odd behavior.</div>

<div><br></div>
<div>I do like randomly seeding the hash. +1. This is easy. It can easily be back ported to any Python version.</div>
<div><br></div>
<div>It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We will provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies.</div>

<div><br></div>
<div>This approach worked fine for Perl 9 years ago. &nbsp;<a href="https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371">https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371</a>
</div>
<div><br></div>
<div>

-gps</div>
</div>
</div>
Steven D'Aprano | 14 Jan 03:55 2012

Re: Status of the fix for the hash collision vulnerability

On 14/01/12 12:58, Gregory P. Smith wrote:

> I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be
> back ported to any Python version.
>
> It is perfectly okay to break existing users who had anything depending on
> ordering of internal hash tables. Their code was already broken.

For the record:

steve <at> runes:~$ python -c "print(hash('spam ham'))"
-376510515
steve <at> runes:~$ jython -c "print(hash('spam ham'))"
2054637885

So it is already the case that Python code that assumes stable hashing is broken.

For what it's worth, I'm not convinced that we should be overly-concerned by 
"poor saps" (Guido's words) who rely on accidents of implementation regarding 
hash. We shouldn't break their code unless we have a good reason, but this 
strikes me as a good reason. The documentation for hash certainly makes no 
promise about stability, and relying on it strikes me as about as sensible as 
relying on the stability of error messages.

I'm also not convinced that the option to raise an exception after 1000 
collisions actually solves the problem. That relies on the application being 
re-written to catch the exception and recover from it (how?). Otherwise, all 
it does is change the attack vector from "cause an indefinite number of hash 
collisions" to "cause 999 hash collisions followed by crashing the application 
with an exception", which doesn't strike me as much of an improvement.

+1 on random seeding. Default to on in 3.3+ and default to off in older 
versions, which allows people to avoid breaking their code until they're ready 
for it to be broken.

--

-- 
Steven
Antoine Pitrou | 14 Jan 09:33 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Sat, 14 Jan 2012 13:55:22 +1100
Steven D'Aprano <steve <at> pearwood.info> wrote:
> On 14/01/12 12:58, Gregory P. Smith wrote:
> 
> > I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be
> > back ported to any Python version.
> >
> > It is perfectly okay to break existing users who had anything depending on
> > ordering of internal hash tables. Their code was already broken.
> 
> For the record:
> 
> steve <at> runes:~$ python -c "print(hash('spam ham'))"
> -376510515
> steve <at> runes:~$ jython -c "print(hash('spam ham'))"
> 2054637885

Not to mention:

$ ./python -c "print(hash('spam ham'))"
-6071355389066156083

(64-bit CPython)

Regards

Antoine.

Gregory P. Smith | 14 Jan 04:06 2012

Re: Status of the fix for the hash collision vulnerability



On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <greg <at> krypto.org> wrote:

On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum <guido <at> python.org> wrote:
On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <solipsis <at> pitrou.net> wrote:
On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum <guido <at> python.org> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
> >1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.

FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.


Agreed.

Of the three options Victor listed only one is good.

I don't like SafeDict.  -1.  It puts the onerous on the coder to always get everything right with regards to data that came from outside the process never ending up hashed in a non-safe dict or set *anywhere*.  "Safe" needs to be the default option for all hash tables.

I don't like the "too many hash collisions" exception. -1. It provides non-deterministic application behavior for data driven applications with no way for them to predict when it'll happen or where and prepare for it. It may work in practice for many applications but is simply odd behavior.

I do like randomly seeding the hash. +1. This is easy. It can easily be back ported to any Python version.

It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We will provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies.

What an implementation looks like:

 
some stuff to be filled in, but this is all that is really required.  add logic to allow a particular seed to be specified or forced to 0 from the command line or environment.  add the logic to grab random bytes.  add the autoconf glue to disable it.  done.

-gps


This approach worked fine for Perl 9 years ago.  https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371

-gps

<div>
<br><br><div class="gmail_quote">On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <span dir="ltr">&lt;<a href="mailto:greg <at> krypto.org">greg <at> krypto.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<br><div class="gmail_quote">
<div><div class="h5">On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum <span dir="ltr">&lt;<a href="mailto:guido <at> python.org" target="_blank">guido <at> python.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<div>
<div class="gmail_quote">On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <span dir="ltr">&lt;<a href="mailto:solipsis <at> pitrou.net" target="_blank">solipsis <at> pitrou.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

On Thu, 12 Jan 2012 18:57:42 -0800<br><div>Guido van Rossum &lt;<a href="mailto:guido <at> python.org" target="_blank">guido <at> python.org</a>&gt; wrote:<br>
</div>
<div>&gt; Hm... I started out as a big fan of the randomized hash, but thinking more<br>
&gt; about it, I actually believe that the chances of some legitimate app having<br>
&gt; &gt;1000 collisions are way smaller than the chances that somebody's code will<br>
&gt; break due to the variable hashing.<br><br>
</div>Breaking due to variable hashing is deterministic: you notice it as<br>
soon as you upgrade (and then you use PYTHONHASHSEED to disable<br>
variable hashing). That seems better than unpredictable breaking when<br>
some legitimate collision chain happens.</blockquote>
</div>
<br>
</div>Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.<br><br>FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.<div><div><br clear="all"></div></div>
</blockquote>
<div><br></div>
</div></div>
<div>Agreed.</div>
<div><br></div>
<div>Of the three options Victor listed only one is good.</div>
<div><br></div>
<div>I don't like SafeDict. &nbsp;-1. &nbsp;It puts the onerous on the coder to always get everything right with regards to data that came from outside the process never ending up hashed in a non-safe dict or set *anywhere*. &nbsp;"Safe" needs to be the default option for all hash tables.</div>

<div><br></div>
<div>I don't like the "too many hash collisions" exception. -1. It provides non-deterministic application behavior for data driven applications with no way for them to predict when it'll happen or where and prepare for it. It may work in practice for many applications but is simply odd behavior.</div>

<div><br></div>
<div>I do like randomly seeding the hash. +1. This is easy. It can easily be back ported to any Python version.</div>
<div><br></div>
<div>It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We will provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies.</div>

</div>
</blockquote>
<div><br></div>
<div>What an implementation looks like:</div>
<div><br></div>
<div>&nbsp;<a href="http://pastebin.com/9ydETTag">http://pastebin.com/9ydETTag</a>
</div>
<div>&nbsp;</div>
<div>some stuff to be filled in, but this is all that is really required. &nbsp;add logic to allow a particular seed to be specified or forced to 0 from the command line or environment. &nbsp;add the logic to grab random bytes. &nbsp;add the autoconf glue to disable it. &nbsp;done.</div>

<div><br></div>
<div>-gps</div>
<div><br></div>
<blockquote class="gmail_quote">
<div class="gmail_quote">
<div><br></div>
<div>This approach worked fine for Perl 9 years ago. &nbsp;<a href="https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371" target="_blank">https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371</a>
</div>

<span class="HOEnZb"><div><br></div>
<div>
-gps</div></span>
</div>
</blockquote>
</div>
<br>
</div>
martin | 14 Jan 04:45 2012
Picon

Re: Status of the fix for the hash collision vulnerability

> What an implementation looks like:
>
>  http://pastebin.com/9ydETTag
>
> some stuff to be filled in, but this is all that is really required.

I think this statement (and the patch) is wrong. You also need to change
the byte string hashing, at least for 2.x. This I consider the biggest
flaw in that approach - other people may have written string-like objects
which continue to compare equal to a string but now hash different.

Regards,
Martin

Antoine Pitrou | 14 Jan 09:33 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Sat, 14 Jan 2012 04:45:57 +0100
martin <at> v.loewis.de wrote:
> > What an implementation looks like:
> >
> >  http://pastebin.com/9ydETTag
> >
> > some stuff to be filled in, but this is all that is really required.
> 
> I think this statement (and the patch) is wrong. You also need to change
> the byte string hashing, at least for 2.x. This I consider the biggest
> flaw in that approach - other people may have written string-like objects
> which continue to compare equal to a string but now hash different.

They're unlikely to have rewritten the hash algorithm by hand -
especially given the caveats wrt. differences between Python integers
and C integers.
Rather, they would have returned the hash() of the equivalent str or
unicode object.

Regards

Antoine.

martin | 14 Jan 13:09 2012
Picon

Re: Status of the fix for the hash collision vulnerability

>> I think this statement (and the patch) is wrong. You also need to change
>> the byte string hashing, at least for 2.x. This I consider the biggest
>> flaw in that approach - other people may have written string-like objects
>> which continue to compare equal to a string but now hash different.
>
> They're unlikely to have rewritten the hash algorithm by hand -
> especially given the caveats wrt. differences between Python integers
> and C integers.

See the CHAR_HASH macro in
http://hg.python.org/cpython/file/e78f00dbd7ae/Modules/expat/xmlparse.c

It's not *that* unlikely that more copies of that algorithm exist.

Regards,
Martin

Gregory P. Smith | 14 Jan 20:17 2012

Re: Status of the fix for the hash collision vulnerability

My patch example does change the bytes object hash as well as Unicode.

On Jan 13, 2012 7:46 PM, <martin <at> v.loewis.de> wrote:
What an implementation looks like:

 http://pastebin.com/9ydETTag

some stuff to be filled in, but this is all that is really required.

I think this statement (and the patch) is wrong. You also need to change
the byte string hashing, at least for 2.x. This I consider the biggest
flaw in that approach - other people may have written string-like objects
which continue to compare equal to a string but now hash different.

Regards,
Martin


_______________________________________________
Python-Dev mailing list
Python-Dev <at> python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/greg%40krypto.org
<div>
<p>My patch example does change the bytes object hash as well as Unicode.</p>
<div class="gmail_quote">On Jan 13, 2012 7:46 PM,  &lt;<a href="mailto:martin <at> v.loewis.de">martin <at> v.loewis.de</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote">
<blockquote class="gmail_quote">
What an implementation looks like:<br><br>
&nbsp;<a href="http://pastebin.com/9ydETTag" target="_blank">http://pastebin.com/9ydETTag</a><br><br>
some stuff to be filled in, but this is all that is really required.<br>
</blockquote>
<br>
I think this statement (and the patch) is wrong. You also need to change<br>
the byte string hashing, at least for 2.x. This I consider the biggest<br>
flaw in that approach - other people may have written string-like objects<br>
which continue to compare equal to a string but now hash different.<br><br>
Regards,<br>
Martin<br><br><br>
_______________________________________________<br>
Python-Dev mailing list<br><a href="mailto:Python-Dev <at> python.org" target="_blank">Python-Dev <at> python.org</a><br><a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a><br>
Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/greg%40krypto.org" target="_blank">http://mail.python.org/mailman/options/python-dev/greg%40krypto.org</a><br>
</blockquote>
</div>
</div>
Gregory P. Smith | 15 Jan 02:31 2012

Re: Status of the fix for the hash collision vulnerability

FWIW the quick change i pastebin'ed is basically covered by the change already under review in http://bugs.python.org/review/13704/show.  I've made my comments and suggestions there.


I looked into Modules/expat/xmlparse.c and it has an odd copy of the old string hash algorithm entirely for its own internal use and its own internal hash table implementations.  That module is likely vulnerable to creatively crafted documents for the same reason.  With 13704 and the public API it provides to get the random hash seed, that module could simply be updated to use that in its own hash implementation.

As for when to enable it or not, I unfortunately have to agree, despite my wild desires we can't turn on the hash randomization change by default in anything prior to 3.3.

-gps

On Sat, Jan 14, 2012 at 11:17 AM, Gregory P. Smith <greg <at> krypto.org> wrote:

My patch example does change the bytes object hash as well as Unicode.

On Jan 13, 2012 7:46 PM, <martin <at> v.loewis.de> wrote:
What an implementation looks like:

 http://pastebin.com/9ydETTag

some stuff to be filled in, but this is all that is really required.

I think this statement (and the patch) is wrong. You also need to change
the byte string hashing, at least for 2.x. This I consider the biggest
flaw in that approach - other people may have written string-like objects
which continue to compare equal to a string but now hash different.

Regards,
Martin


_______________________________________________
Python-Dev mailing list
Python-Dev <at> python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/greg%40krypto.org

<div>
<p>FWIW the quick change i pastebin'ed is basically covered by the change already under review in&nbsp;<a href="http://bugs.python.org/review/13704/show">http://bugs.python.org/review/13704/show</a>. &nbsp;I've made my comments and suggestions there.</p>
<div>

<br>
</div>
<div>I looked into Modules/expat/xmlparse.c and it has an odd copy of the old string hash algorithm entirely for its own internal use and its own internal hash table implementations. &nbsp;That module is likely vulnerable to creatively crafted documents for the same reason. &nbsp;With 13704 and the public API it provides to get the random hash seed, that module could simply be updated to use that in its own hash implementation.</div>

<div><br></div>
<div>As for when to enable it or not, I unfortunately have to agree, despite my wild desires we can't turn on the hash randomization change by default in anything prior to 3.3.</div>
<div><br></div>
<div>

-gps</div>
<div>
<br><div class="gmail_quote">On Sat, Jan 14, 2012 at 11:17 AM, Gregory P. Smith <span dir="ltr">&lt;<a href="mailto:greg <at> krypto.org">greg <at> krypto.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<p>My patch example does change the bytes object hash as well as Unicode.</p>
<div class="HOEnZb"><div class="h5">
<div class="gmail_quote">On Jan 13, 2012 7:46 PM,  &lt;<a href="mailto:martin <at> v.loewis.de" target="_blank">martin <at> v.loewis.de</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote">

<blockquote class="gmail_quote">
What an implementation looks like:<br><br>
&nbsp;<a href="http://pastebin.com/9ydETTag" target="_blank">http://pastebin.com/9ydETTag</a><br><br>
some stuff to be filled in, but this is all that is really required.<br>
</blockquote>
<br>
I think this statement (and the patch) is wrong. You also need to change<br>
the byte string hashing, at least for 2.x. This I consider the biggest<br>
flaw in that approach - other people may have written string-like objects<br>
which continue to compare equal to a string but now hash different.<br><br>
Regards,<br>
Martin<br><br><br>
_______________________________________________<br>
Python-Dev mailing list<br><a href="mailto:Python-Dev <at> python.org" target="_blank">Python-Dev <at> python.org</a><br><a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a><br>
Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/greg%40krypto.org" target="_blank">http://mail.python.org/mailman/options/python-dev/greg%40krypto.org</a><br>
</blockquote>
</div>
</div></div>
</blockquote>
</div>
<br>
</div>
</div>
Guido van Rossum | 14 Jan 05:00 2012

Re: Status of the fix for the hash collision vulnerability

On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <greg <at> krypto.org> wrote:
It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We will provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies.

No, that is not how we usually take compatibility between bugfix releases. "Your code is already broken" is not an argument to break forcefully what worked (even if by happenstance) before. The difference between CPython and Jython (or between different CPython feature releases) also isn't relevant -- historically we have often bent over backwards to avoid changing behavior that was technically undefined, if we believed it would affect a significant fraction of users.

I don't think anyone doubts that this will break lots of code (at least, the arguments I've heard have been "their code is broken", not "nobody does that").

This approach worked fine for Perl 9 years ago.  https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371

I don't know what the Perl attitude about breaking undefined behavior between micro versions was at the time. But ours is pretty clear -- don't do it.

--
--Guido van Rossum (python.org/~guido)
<div>
<div class="gmail_quote">On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <span dir="ltr">&lt;<a href="mailto:greg <at> krypto.org">greg <at> krypto.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We will provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies.</blockquote>

<div>
<br>No, that is not how we usually take compatibility between bugfix releases. "Your code is already broken" is not an argument to break forcefully what worked (even if by happenstance) before. The difference between CPython and Jython (or between different CPython feature releases) also isn't relevant -- historically we have often bent over backwards to avoid changing behavior that was technically undefined, if we believed it would affect a significant fraction of users.<br><br>I don't think anyone doubts that this will break lots of code (at least, the arguments I've heard have been "their code is broken", not "nobody does that").<br><br>
</div>
<blockquote class="gmail_quote">

<div class="gmail_quote">
<div></div>
<div>This approach worked fine for Perl 9 years ago. &nbsp;<a href="https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371" target="_blank">https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371</a>
</div>

</div>
</blockquote>
<div>
<br>I don't know what the Perl attitude about breaking undefined behavior between micro versions was at the time. But ours is pretty clear -- don't do it.<br>
</div>
</div>
<br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
Steven D'Aprano | 15 Jan 05:49 2012

Re: Status of the fix for the hash collision vulnerability

Guido van Rossum wrote:
> On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <greg <at> krypto.org> wrote:
> 
>> It is perfectly okay to break existing users who had anything depending on
>> ordering of internal hash tables. Their code was already broken. We *will*provide a flag and/or
environment variable that can be set to turn the
>> feature off at their own peril which they can use in their test harnesses
>> that are stupid enough to use doctests with order dependencies.
> 
> 
> No, that is not how we usually take compatibility between bugfix releases.
> "Your code is already broken" is not an argument to break forcefully what
> worked (even if by happenstance) before. The difference between CPython and
> Jython (or between different CPython feature releases) also isn't relevant
> -- historically we have often bent over backwards to avoid changing
> behavior that was technically undefined, if we believed it would affect a
> significant fraction of users.
> 
> I don't think anyone doubts that this will break lots of code (at least,
> the arguments I've heard have been "their code is broken", not "nobody does
> that").

I don't know about "lots" of code, but it will break at least one library (or 
so I'm told):

http://mail.python.org/pipermail/python-list/2012-January/1286535.html

--

-- 
Steven
Hynek Schlawack | 15 Jan 13:15 2012

Re: Status of the fix for the hash collision ulnerability

Am Sonntag, 15. Januar 2012 um 05:49 schrieb Steven D'Aprano:
> > I don't think anyone doubts that this will break lots of code (at least,
> > the arguments I've heard have been "their code is broken", not "nobody does
> > that").
> 
> I don't know about "lots" of code, but it will break at least one library (or 
> so I'm told):
> 
> http://mail.python.org/pipermail/python-list/2012-January/1286535.html
Sadly, suds is also Python's _only_ usable SOAP library at this moment. :( (on top of that, the development
is in limbo ATM)

Victor Stinner | 15 Jan 15:27 2012

Re: Status of the fix for the hash collision ulnerability

I don't think that it would be hard to patch this library to use
another hash function. It can implement its own hash function, use
MD5, SHA1, or anything else. hash() is not stable accross Python
versions and 32/64 bit systems.

Victor

2012/1/15 Hynek Schlawack <hs <at> ox.cx>:
> Am Sonntag, 15. Januar 2012 um 05:49 schrieb Steven D'Aprano:
>> > I don't think anyone doubts that this will break lots of code (at least,
>> > the arguments I've heard have been "their code is broken", not "nobody does
>> > that").
>>
>> I don't know about "lots" of code, but it will break at least one library (or
>> so I'm told):
>>
>> http://mail.python.org/pipermail/python-list/2012-January/1286535.html
> Sadly, suds is also Python's _only_ usable SOAP library at this moment. :( (on top of that, the development
is in limbo ATM)
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev <at> python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/victor.stinner%40haypocalc.com
Heiko Wundram | 15 Jan 19:40 2012

Re: Status of the fix for the hash collision ulnerability

Am 15.01.2012 15:27, schrieb Victor Stinner:
> I don't think that it would be hard to patch this library to use
> another hash function. It can implement its own hash function, use
> MD5, SHA1, or anything else. hash() is not stable accross Python
> versions and 32/64 bit systems.

As I wrote in a reply further down: no, it isn't hard to change this 
behaviour (and I find the current caching system, which uses hash() on 
an URL to choose the cache index, braindead to begin with), but, as with 
all other considerations: the current version of the library, with the 
default options, depends on hash() to be stable for the cache to make 
any sense at all (and especially with "generic" schema such as the 
referenced xml.dtd, caching makes a lot of sense, and not being able to 
cache _breaks_ applications as it did mine). This is juts something to 
bear in mind.

--

-- 
--- Heiko.
Terry Reedy | 14 Jan 06:43 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On 1/13/2012 8:58 PM, Gregory P. Smith wrote:

> It is perfectly okay to break existing users who had anything depending
> on ordering of internal hash tables. Their code was already broken.

Given that the doc says "Return the hash value of the object", I do not 
think we should be so hard-nosed. The above clearly implies that there 
is such a thing as *the* Python hash value for an object. And indeed, 
that has been true across many versions. If we had written "Return a 
hash value for the object, which can vary from run to run", the case 
would be different.

--

-- 
Terry Jan Reedy

Stefan Behnel | 15 Jan 15:30 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Terry Reedy, 14.01.2012 06:43:
> On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
> 
>> It is perfectly okay to break existing users who had anything depending
>> on ordering of internal hash tables. Their code was already broken.
> 
> Given that the doc says "Return the hash value of the object", I do not
> think we should be so hard-nosed. The above clearly implies that there is
> such a thing as *the* Python hash value for an object. And indeed, that has
> been true across many versions. If we had written "Return a hash value for
> the object, which can vary from run to run", the case would be different.

Just a side note, but I don't think hash() is the right place to document
this. Hashing is a protocol in Python, just like indexing or iteration.
Nothing keeps an object from changing its hash value due to modification,
and that would even be valid in the face of the usual dict lookup
invariants if changes are only applied while the object is not referenced
by any dict. So the guarantees do not depend on the function hash() and may
be even weaker than your above statement.

Stefan

Guido van Rossum | 15 Jan 17:10 2012

Re: Status of the fix for the hash collision vulnerability

On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
Terry Reedy, 14.01.2012 06:43:
> On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
>
>> It is perfectly okay to break existing users who had anything depending
>> on ordering of internal hash tables. Their code was already broken.
>
> Given that the doc says "Return the hash value of the object", I do not
> think we should be so hard-nosed. The above clearly implies that there is
> such a thing as *the* Python hash value for an object. And indeed, that has
> been true across many versions. If we had written "Return a hash value for
> the object, which can vary from run to run", the case would be different.

Just a side note, but I don't think hash() is the right place to document
this.

You mean we shouldn't document that the hash() of a string will vary per run?
 
Hashing is a protocol in Python, just like indexing or iteration.
Nothing keeps an object from changing its hash value due to modification,

Eh? There's a huge body of cultural awareness that only immutable objects should define a hash, implying that the hash remains constant during the object's lifetime.
 
and that would even be valid in the face of the usual dict lookup
invariants if changes are only applied while the object is not referenced
by any dict.

And how would you know it isn't?
 
So the guarantees do not depend on the function hash() and may
be even weaker than your above statement.

There are no actual guarantees for hash(), but lots of rules for well-behaved hashes.

--
--Guido van Rossum (python.org/~guido)
<div>
<div class="gmail_quote">On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel <span dir="ltr">&lt;<a href="mailto:stefan_ml <at> behnel.de">stefan_ml <at> behnel.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

Terry Reedy, 14.01.2012 06:43:<br><div class="im">&gt; On 1/13/2012 8:58 PM, Gregory P. Smith wrote:<br>
&gt;<br>
&gt;&gt; It is perfectly okay to break existing users who had anything depending<br>
&gt;&gt; on ordering of internal hash tables. Their code was already broken.<br>
&gt;<br>
&gt; Given that the doc says "Return the hash value of the object", I do not<br>
&gt; think we should be so hard-nosed. The above clearly implies that there is<br>
&gt; such a thing as *the* Python hash value for an object. And indeed, that has<br>
&gt; been true across many versions. If we had written "Return a hash value for<br>
&gt; the object, which can vary from run to run", the case would be different.<br><br>
</div>Just a side note, but I don't think hash() is the right place to document<br>
this. </blockquote>
<div>
<br>You mean we shouldn't document that the hash() of a string will vary per run?<br>&nbsp;</div>
<blockquote class="gmail_quote">

Hashing is a protocol in Python, just like indexing or iteration.<br>
Nothing keeps an object from changing its hash value due to modification,<br>
</blockquote>
<div>
<br>Eh? There's a huge body of cultural awareness that only immutable objects should define a hash, implying that the hash remains constant during the object's lifetime.<br>&nbsp;</div>
<blockquote class="gmail_quote">

and that would even be valid in the face of the usual dict lookup<br>
invariants if changes are only applied while the object is not referenced<br>
by any dict.</blockquote>
<div>
<br>And how would you know it isn't?<br>&nbsp;</div>
<blockquote class="gmail_quote"> So the guarantees do not depend on the function hash() and may<br>

be even weaker than your above statement.<span class="HOEnZb"></span><br>
</blockquote>
</div>
<br>There are no actual guarantees for hash(), but lots of rules for well-behaved hashes.<br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
Stefan Behnel | 15 Jan 17:46 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Guido van Rossum, 15.01.2012 17:10:
> On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
>> Terry Reedy, 14.01.2012 06:43:
>>> On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
>>>
>>>> It is perfectly okay to break existing users who had anything depending
>>>> on ordering of internal hash tables. Their code was already broken.
>>>
>>> Given that the doc says "Return the hash value of the object", I do not
>>> think we should be so hard-nosed. The above clearly implies that there is
>>> such a thing as *the* Python hash value for an object. And indeed, that
>> has
>>> been true across many versions. If we had written "Return a hash value
>> for
>>> the object, which can vary from run to run", the case would be different.
>>
>> Just a side note, but I don't think hash() is the right place to document
>> this.
> 
> You mean we shouldn't document that the hash() of a string will vary per
> run?

No, I mean that the hash() builtin function is not the right place to
document the behaviour of a string hash. That should go into the string
object documentation.

Although, arguably, it may be worth mentioning in the docs of hash() that,
in general, hash values of builtin types are bound to the lifetime of the
interpreter instance (or entire runtime?) and may change after restarts. I
think that's a reasonable restriction to document that prominently, even if
it will only apply to str for the time being.

>> Hashing is a protocol in Python, just like indexing or iteration.
>> Nothing keeps an object from changing its hash value due to modification,
> 
> Eh? There's a huge body of cultural awareness that only immutable objects
> should define a hash, implying that the hash remains constant during the
> object's lifetime.
> 
>> and that would even be valid in the face of the usual dict lookup
>> invariants if changes are only applied while the object is not referenced
>> by any dict.
> 
> And how would you know it isn't?

Well, if it's an object with a mutable hash then it's up to the application
defining that object to make sure it's used in a sensible way. Immutability
just makes your life easier. I can imagine that an object gets removed from
a dict (say, a cache), modified and then reinserted, and I think it's valid
to allow the modification to have an impact on the hash in this case, in
order to accommodate for any changes to equality comparisons due to the
modification.

That being said, it seems that the Python docs actually consider constant
hashes a requirement rather than a virtue.

http://docs.python.org/glossary.html#term-hashable

"""
An object is hashable if it has a hash value which never changes during its
lifetime (it needs a __hash__() method), and can be compared to other
objects (it needs an __eq__() or __cmp__() method). Hashable objects which
compare equal must have the same hash value.
"""

It also seems to me that the wording "has a hash value which never changes
during its lifetime" makes it pretty clear that the lifetime of the hash
value is not guaranteed to supersede the lifetime of the object (although
that's a rather muddy definition - memory lifetime? or pickle-unpickle as
well?).

However, this entry in the glossary only seems to have appeared with Py2.6,
likely as a result of the abc changes. So it won't help in defending a
change to the hash function.

>> So the guarantees do not depend on the function hash() and may
>> be even weaker than your above statement.
> 
> There are no actual guarantees for hash(), but lots of rules for
> well-behaved hashes.

Absolutely.

Stefan

Gregory P. Smith | 15 Jan 18:02 2012

Re: Status of the fix for the hash collision vulnerability


On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
It also seems to me that the wording "has a hash value which never changes
during its lifetime" makes it pretty clear that the lifetime of the hash
value is not guaranteed to supersede the lifetime of the object (although
that's a rather muddy definition - memory lifetime? or pickle-unpickle as
well?).

Lifetime to me means of that specific instance of the object. I would not expect that to survive pickle-unpickle.


However, this entry in the glossary only seems to have appeared with Py2.6,
likely as a result of the abc changes. So it won't help in defending a
change to the hash function.

Ugh, I really hope there is no code out there depending on the hash function being the same across a pickle and unpickle boundary.  Unfortunately the hash function was last changed in 1996 in http://hg.python.org/cpython/rev/839f72610ae1 so it is possible someone somewhere has written code blindly assuming that non-guarantee is true.

-gps

<div>
<br><div class="gmail_quote">On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <span dir="ltr">&lt;<a href="mailto:stefan_ml <at> behnel.de">stefan_ml <at> behnel.de</a>&gt;</span> wrote:<blockquote class="gmail_quote">

It also seems to me that the wording "has a hash value which never changes<br>
during its lifetime" makes it pretty clear that the lifetime of the hash<br>
value is not guaranteed to supersede the lifetime of the object (although<br>
that's a rather muddy definition - memory lifetime? or pickle-unpickle as<br>
well?).<br>
</blockquote>
<div><br></div>
<div>Lifetime to me means of that specific instance of the object. I would not expect that to survive pickle-unpickle.</div>
<div><br></div>
<blockquote class="gmail_quote">

<br>
However, this entry in the glossary only seems to have appeared with Py2.6,<br>
likely as a result of the abc changes. So it won't help in defending a<br>
change to the hash function.<br>
</blockquote>
<div><br></div>
<div>Ugh, I really hope there is no code out there depending on the hash function being the same across a pickle and unpickle boundary. &nbsp;Unfortunately the hash function was last changed in 1996 in&nbsp;<a href="http://hg.python.org/cpython/rev/839f72610ae1">http://hg.python.org/cpython/rev/839f72610ae1</a>&nbsp;so it is possible someone somewhere has written code blindly assuming that non-guarantee is true.</div>

<div><br></div>
<div>-gps</div>
<div><br></div>
</div>
</div>
Antoine Pitrou | 15 Jan 18:11 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Sun, 15 Jan 2012 17:46:36 +0100
Stefan Behnel <stefan_ml <at> behnel.de> wrote:
> Guido van Rossum, 15.01.2012 17:10:
> > On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
> >> Terry Reedy, 14.01.2012 06:43:
> >>> On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
> >>>
> >>>> It is perfectly okay to break existing users who had anything depending
> >>>> on ordering of internal hash tables. Their code was already broken.
> >>>
> >>> Given that the doc says "Return the hash value of the object", I do not
> >>> think we should be so hard-nosed. The above clearly implies that there is
> >>> such a thing as *the* Python hash value for an object. And indeed, that
> >> has
> >>> been true across many versions. If we had written "Return a hash value
> >> for
> >>> the object, which can vary from run to run", the case would be different.
> >>
> >> Just a side note, but I don't think hash() is the right place to document
> >> this.
> > 
> > You mean we shouldn't document that the hash() of a string will vary per
> > run?
> 
> No, I mean that the hash() builtin function is not the right place to
> document the behaviour of a string hash. That should go into the string
> object documentation.

No, but we can document that *any* hash() value can vary between runs
without being specific about which builtin types randomize their
hashes right now.

Regards

Antoine.

Guido van Rossum | 15 Jan 18:44 2012

Re: Status of the fix for the hash collision vulnerability

On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
Guido van Rossum, 15.01.2012 17:10:
> On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
>> Terry Reedy, 14.01.2012 06:43:
>>> On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
>>>
>>>> It is perfectly okay to break existing users who had anything depending
>>>> on ordering of internal hash tables. Their code was already broken.
>>>
>>> Given that the doc says "Return the hash value of the object", I do not
>>> think we should be so hard-nosed. The above clearly implies that there is
>>> such a thing as *the* Python hash value for an object. And indeed, that
>> has
>>> been true across many versions. If we had written "Return a hash value
>> for
>>> the object, which can vary from run to run", the case would be different.
>>
>> Just a side note, but I don't think hash() is the right place to document
>> this.
>
> You mean we shouldn't document that the hash() of a string will vary per
> run?

No, I mean that the hash() builtin function is not the right place to
document the behaviour of a string hash. That should go into the string
object documentation.

Although, arguably, it may be worth mentioning in the docs of hash() that,
in general, hash values of builtin types are bound to the lifetime of the
interpreter instance (or entire runtime?) and may change after restarts. I
think that's a reasonable restriction to document that prominently, even if
it will only apply to str for the time being.

Actually it will apply to a lot more than str, because the hash of (immutable) compound objects is often derived from the hash of the constituents, e.g. hash of a tuple.
 
>> Hashing is a protocol in Python, just like indexing or iteration.
>> Nothing keeps an object from changing its hash value due to modification,
>
> Eh? There's a huge body of cultural awareness that only immutable objects
> should define a hash, implying that the hash remains constant during the
> object's lifetime.
>
>> and that would even be valid in the face of the usual dict lookup
>> invariants if changes are only applied while the object is not referenced
>> by any dict.
>
> And how would you know it isn't?

Well, if it's an object with a mutable hash then it's up to the application
defining that object to make sure it's used in a sensible way. Immutability
just makes your life easier. I can imagine that an object gets removed from
a dict (say, a cache), modified and then reinserted, and I think it's valid
to allow the modification to have an impact on the hash in this case, in
order to accommodate for any changes to equality comparisons due to the
modification.

That could be considered valid only in a very abstract, theoretical, non-constructive way, since there is no protocol to detect removal from a dict (and you cannot assume an object is used in only one dict at a time).
 
That being said, it seems that the Python docs actually consider constant
hashes a requirement rather than a virtue.

http://docs.python.org/glossary.html#term-hashable

"""
An object is hashable if it has a hash value which never changes during its
lifetime (it needs a __hash__() method), and can be compared to other
objects (it needs an __eq__() or __cmp__() method). Hashable objects which
compare equal must have the same hash value.
"""

It also seems to me that the wording "has a hash value which never changes
during its lifetime" makes it pretty clear that the lifetime of the hash
value is not guaranteed to supersede the lifetime of the object (although
that's a rather muddy definition - memory lifetime? or pickle-unpickle as
well?).

Across pickle-unpickle it's not considered the same object. Pickling at best preserves values.

However, this entry in the glossary only seems to have appeared with Py2.6,
likely as a result of the abc changes. So it won't help in defending a
change to the hash function.


>> So the guarantees do not depend on the function hash() and may
>> be even weaker than your above statement.
>
> There are no actual guarantees for hash(), but lots of rules for
> well-behaved hashes.

Absolutely.

--
--Guido van Rossum (python.org/~guido)
<div>
<div class="gmail_quote">On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <span dir="ltr">&lt;<a href="mailto:stefan_ml <at> behnel.de">stefan_ml <at> behnel.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

Guido van Rossum, 15.01.2012 17:10:<br><div class="im">&gt; On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:<br>
&gt;&gt; Terry Reedy, 14.01.2012 06:43:<br>
&gt;&gt;&gt; On 1/13/2012 8:58 PM, Gregory P. Smith wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; It is perfectly okay to break existing users who had anything depending<br>
&gt;&gt;&gt;&gt; on ordering of internal hash tables. Their code was already broken.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Given that the doc says "Return the hash value of the object", I do not<br>
&gt;&gt;&gt; think we should be so hard-nosed. The above clearly implies that there is<br>
&gt;&gt;&gt; such a thing as *the* Python hash value for an object. And indeed, that<br>
&gt;&gt; has<br>
&gt;&gt;&gt; been true across many versions. If we had written "Return a hash value<br>
&gt;&gt; for<br>
&gt;&gt;&gt; the object, which can vary from run to run", the case would be different.<br>
&gt;&gt;<br>
&gt;&gt; Just a side note, but I don't think hash() is the right place to document<br>
&gt;&gt; this.<br>
&gt;<br>
&gt; You mean we shouldn't document that the hash() of a string will vary per<br>
&gt; run?<br><br>
</div>No, I mean that the hash() builtin function is not the right place to<br>
document the behaviour of a string hash. That should go into the string<br>
object documentation.<br><br>
Although, arguably, it may be worth mentioning in the docs of hash() that,<br>
in general, hash values of builtin types are bound to the lifetime of the<br>
interpreter instance (or entire runtime?) and may change after restarts. I<br>
think that's a reasonable restriction to document that prominently, even if<br>
it will only apply to str for the time being.<br><div class="im"></div>
</blockquote>
<div>
<br>Actually it will apply to a lot more than str, because the hash of (immutable) compound objects is often derived from the hash of the constituents, e.g. hash of a tuple.<br>

&nbsp;</div>
<blockquote class="gmail_quote">
<div class="im">
&gt;&gt; Hashing is a protocol in Python, just like indexing or iteration.<br>
&gt;&gt; Nothing keeps an object from changing its hash value due to modification,<br>
&gt;<br>
&gt; Eh? There's a huge body of cultural awareness that only immutable objects<br>
&gt; should define a hash, implying that the hash remains constant during the<br>
&gt; object's lifetime.<br>
&gt;<br>
&gt;&gt; and that would even be valid in the face of the usual dict lookup<br>
&gt;&gt; invariants if changes are only applied while the object is not referenced<br>
&gt;&gt; by any dict.<br>
&gt;<br>
&gt; And how would you know it isn't?<br><br>
</div>Well, if it's an object with a mutable hash then it's up to the application<br>
defining that object to make sure it's used in a sensible way. Immutability<br>
just makes your life easier. I can imagine that an object gets removed from<br>
a dict (say, a cache), modified and then reinserted, and I think it's valid<br>
to allow the modification to have an impact on the hash in this case, in<br>
order to accommodate for any changes to equality comparisons due to the<br>
modification.<br>
</blockquote>
<div>
<br>That could be considered valid only in a very abstract, theoretical, non-constructive way, since there is no protocol to detect removal from a dict (and you cannot assume an object is used in only one dict at a time).<br>

&nbsp;</div>
<blockquote class="gmail_quote">
That being said, it seems that the Python docs actually consider constant<br>
hashes a requirement rather than a virtue.<br><br><a href="http://docs.python.org/glossary.html#term-hashable" target="_blank">http://docs.python.org/glossary.html#term-hashable</a><br><br>
"""<br>
An object is hashable if it has a hash value which never changes during its<br>
lifetime (it needs a __hash__() method), and can be compared to other<br>
objects (it needs an __eq__() or __cmp__() method). Hashable objects which<br>
compare equal must have the same hash value.<br>
"""<br><br>
It also seems to me that the wording "has a hash value which never changes<br>
during its lifetime" makes it pretty clear that the lifetime of the hash<br>
value is not guaranteed to supersede the lifetime of the object (although<br>
that's a rather muddy definition - memory lifetime? or pickle-unpickle as<br>
well?).<br>
</blockquote>
<div>
<br>Across pickle-unpickle it's not considered the same object. Pickling at best preserves values.<br><br>
</div>
<blockquote class="gmail_quote">

However, this entry in the glossary only seems to have appeared with Py2.6,<br>
likely as a result of the abc changes. So it won't help in defending a<br>
change to the hash function.<br><div class="im">
<br><br>
&gt;&gt; So the guarantees do not depend on the function hash() and may<br>
&gt;&gt; be even weaker than your above statement.<br>
&gt;<br>
&gt; There are no actual guarantees for hash(), but lots of rules for<br>
&gt; well-behaved hashes.<br><br>
</div>Absolutely.<br>
</blockquote>
</div>
<br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
Gregory P. Smith | 16 Jan 21:16 2012

Re: Status of the fix for the hash collision vulnerability



On Sun, Jan 15, 2012 at 9:44 AM, Guido van Rossum <guido <at> python.org> wrote:
On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
Guido van Rossum, 15.01.2012 17:10:
> On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
>> Terry Reedy, 14.01.2012 06:43:
>>> On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
>>>
>>>> It is perfectly okay to break existing users who had anything depending
>>>> on ordering of internal hash tables. Their code was already broken.
>>>
>>> Given that the doc says "Return the hash value of the object", I do not
>>> think we should be so hard-nosed. The above clearly implies that there is
>>> such a thing as *the* Python hash value for an object. And indeed, that
>> has
>>> been true across many versions. If we had written "Return a hash value
>> for
>>> the object, which can vary from run to run", the case would be different.
>>
>> Just a side note, but I don't think hash() is the right place to document
>> this.
>
> You mean we shouldn't document that the hash() of a string will vary per
> run?

No, I mean that the hash() builtin function is not the right place to
document the behaviour of a string hash. That should go into the string
object documentation.

Although, arguably, it may be worth mentioning in the docs of hash() that,
in general, hash values of builtin types are bound to the lifetime of the
interpreter instance (or entire runtime?) and may change after restarts. I
think that's a reasonable restriction to document that prominently, even if
it will only apply to str for the time being.

Actually it will apply to a lot more than str, because the hash of (immutable) compound objects is often derived from the hash of the constituents, e.g. hash of a tuple.
 
>> Hashing is a protocol in Python, just like indexing or iteration.
>> Nothing keeps an object from changing its hash value due to modification,
>
> Eh? There's a huge body of cultural awareness that only immutable objects
> should define a hash, implying that the hash remains constant during the
> object's lifetime.
>
>> and that would even be valid in the face of the usual dict lookup
>> invariants if changes are only applied while the object is not referenced
>> by any dict.
>
> And how would you know it isn't?

Well, if it's an object with a mutable hash then it's up to the application
defining that object to make sure it's used in a sensible way. Immutability
just makes your life easier. I can imagine that an object gets removed from
a dict (say, a cache), modified and then reinserted, and I think it's valid
to allow the modification to have an impact on the hash in this case, in
order to accommodate for any changes to equality comparisons due to the
modification.

That could be considered valid only in a very abstract, theoretical, non-constructive way, since there is no protocol to detect removal from a dict (and you cannot assume an object is used in only one dict at a time).
 
That being said, it seems that the Python docs actually consider constant
hashes a requirement rather than a virtue.

http://docs.python.org/glossary.html#term-hashable

"""
An object is hashable if it has a hash value which never changes during its
lifetime (it needs a __hash__() method), and can be compared to other
objects (it needs an __eq__() or __cmp__() method). Hashable objects which
compare equal must have the same hash value.
"""

It also seems to me that the wording "has a hash value which never changes
during its lifetime" makes it pretty clear that the lifetime of the hash
value is not guaranteed to supersede the lifetime of the object (although
that's a rather muddy definition - memory lifetime? or pickle-unpickle as
well?).

Across pickle-unpickle it's not considered the same object. Pickling at best preserves values.

Updating the docs to explicitly clarify this sounds like a good idea.  How does this wording to be added to the glossary.rst hashing section sound?

"""Hash values may not be stable across Python processes and must not be used for storage or otherwise communicated outside of a single Python session."""

-gps
<div>
<br><br><div class="gmail_quote">On Sun, Jan 15, 2012 at 9:44 AM, Guido van Rossum <span dir="ltr">&lt;<a href="mailto:guido <at> python.org">guido <at> python.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<div class="gmail_quote">
<div><div class="h5">On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <span dir="ltr">&lt;<a href="mailto:stefan_ml <at> behnel.de" target="_blank">stefan_ml <at> behnel.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

Guido van Rossum, 15.01.2012 17:10:<br><div>&gt; On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:<br>
&gt;&gt; Terry Reedy, 14.01.2012 06:43:<br>
&gt;&gt;&gt; On 1/13/2012 8:58 PM, Gregory P. Smith wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; It is perfectly okay to break existing users who had anything depending<br>
&gt;&gt;&gt;&gt; on ordering of internal hash tables. Their code was already broken.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Given that the doc says "Return the hash value of the object", I do not<br>
&gt;&gt;&gt; think we should be so hard-nosed. The above clearly implies that there is<br>
&gt;&gt;&gt; such a thing as *the* Python hash value for an object. And indeed, that<br>
&gt;&gt; has<br>
&gt;&gt;&gt; been true across many versions. If we had written "Return a hash value<br>
&gt;&gt; for<br>
&gt;&gt;&gt; the object, which can vary from run to run", the case would be different.<br>
&gt;&gt;<br>
&gt;&gt; Just a side note, but I don't think hash() is the right place to document<br>
&gt;&gt; this.<br>
&gt;<br>
&gt; You mean we shouldn't document that the hash() of a string will vary per<br>
&gt; run?<br><br>
</div>No, I mean that the hash() builtin function is not the right place to<br>
document the behaviour of a string hash. That should go into the string<br>
object documentation.<br><br>
Although, arguably, it may be worth mentioning in the docs of hash() that,<br>
in general, hash values of builtin types are bound to the lifetime of the<br>
interpreter instance (or entire runtime?) and may change after restarts. I<br>
think that's a reasonable restriction to document that prominently, even if<br>
it will only apply to str for the time being.<br><div></div>
</blockquote>
</div></div>
<div>
<br>Actually it will apply to a lot more than str, because the hash of (immutable) compound objects is often derived from the hash of the constituents, e.g. hash of a tuple.<br>

&nbsp;</div>
<div class="im"><blockquote class="gmail_quote">
<div>
&gt;&gt; Hashing is a protocol in Python, just like indexing or iteration.<br>
&gt;&gt; Nothing keeps an object from changing its hash value due to modification,<br>
&gt;<br>
&gt; Eh? There's a huge body of cultural awareness that only immutable objects<br>
&gt; should define a hash, implying that the hash remains constant during the<br>
&gt; object's lifetime.<br>
&gt;<br>
&gt;&gt; and that would even be valid in the face of the usual dict lookup<br>
&gt;&gt; invariants if changes are only applied while the object is not referenced<br>
&gt;&gt; by any dict.<br>
&gt;<br>
&gt; And how would you know it isn't?<br><br>
</div>Well, if it's an object with a mutable hash then it's up to the application<br>
defining that object to make sure it's used in a sensible way. Immutability<br>
just makes your life easier. I can imagine that an object gets removed from<br>
a dict (say, a cache), modified and then reinserted, and I think it's valid<br>
to allow the modification to have an impact on the hash in this case, in<br>
order to accommodate for any changes to equality comparisons due to the<br>
modification.<br>
</blockquote></div>
<div>
<br>That could be considered valid only in a very abstract, theoretical, non-constructive way, since there is no protocol to detect removal from a dict (and you cannot assume an object is used in only one dict at a time).<br>

&nbsp;</div>
<div class="im"><blockquote class="gmail_quote">
That being said, it seems that the Python docs actually consider constant<br>
hashes a requirement rather than a virtue.<br><br><a href="http://docs.python.org/glossary.html#term-hashable" target="_blank">http://docs.python.org/glossary.html#term-hashable</a><br><br>
"""<br>
An object is hashable if it has a hash value which never changes during its<br>
lifetime (it needs a __hash__() method), and can be compared to other<br>
objects (it needs an __eq__() or __cmp__() method). Hashable objects which<br>
compare equal must have the same hash value.<br>
"""<br><br>
It also seems to me that the wording "has a hash value which never changes<br>
during its lifetime" makes it pretty clear that the lifetime of the hash<br>
value is not guaranteed to supersede the lifetime of the object (although<br>
that's a rather muddy definition - memory lifetime? or pickle-unpickle as<br>
well?).<br>
</blockquote></div>
<div>
<br>Across pickle-unpickle it's not considered the same object. Pickling at best preserves values.<br>
</div>
</div>
</blockquote>
<div><br></div>
<div>Updating the docs to explicitly clarify this sounds like a good idea. &nbsp;How does this wording to be added to the glossary.rst hashing section sound?</div>

<div><br></div>
<div>"""Hash values may not be&nbsp;stable across Python processes and must not be used&nbsp;for storage or otherwise communicated outside of a single Python session."""</div>
<div><br></div>

<div>-gps</div>
</div>
</div>
Barry Warsaw | 14 Jan 04:19 2012

Re: Status of the fix for the hash collision vulnerability

On Jan 13, 2012, at 05:38 PM, Guido van Rossum wrote:

>On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <solipsis <at> pitrou.net> wrote:
>
>> Breaking due to variable hashing is deterministic: you notice it as
>> soon as you upgrade (and then you use PYTHONHASHSEED to disable
>> variable hashing). That seems better than unpredictable breaking when
>> some legitimate collision chain happens.
>
>
>Fair enough. But I'm now uncomfortable with turning this on for bugfix
>releases. I'm fine with making this the default in 3.3, just not in 3.2,
>3.1 or 2.x -- it will break too much code and organizations will have to
>roll back the release or do extensive testing before installing a bugfix
>release -- exactly what we *don't* want for those.

+1

-Barry
Jack Diederich | 14 Jan 07:24 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Thu, Jan 12, 2012 at 9:57 PM, Guido van Rossum <guido <at> python.org> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
>>1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Python's dicts are designed to avoid hash conflicts by resizing and
keeping the available slots bountiful.  1000 conflicts sounds like a
number that couldn't be hit accidentally unless you had a single dict
using a terabyte of RAM (i.e. if Titus Brown doesn't object, we're
good).   The hashes also look to exploit cache locality but that is
very unlikely to get one thousand conflicts by chance.  If you get
that many there is an attack.

> This is depending on how the counting is done (I didn't look at MAL's
> patch), and assuming that increasing the hash table size will generally
> reduce collisions if items collide but their hashes are different.

The patch counts conflicts on an individual insert and not lifetime
conflicts.  Looks sane to me.

> That said, even with collision counting I'd like a way to disable it without
> changing the code, e.g. a flag or environment variable.

Agreed.  Paranoid people can turn the behavior off and if it ever were
to become a problem in practice we could point people to a solution.

-Jack
Nick Coghlan | 14 Jan 08:01 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Sat, Jan 14, 2012 at 4:24 PM, Jack Diederich <jackdied <at> gmail.com> wrote:
>> This is depending on how the counting is done (I didn't look at MAL's
>> patch), and assuming that increasing the hash table size will generally
>> reduce collisions if items collide but their hashes are different.
>
> The patch counts conflicts on an individual insert and not lifetime
> conflicts.  Looks sane to me.

Having a hard limit on the worst-case behaviour certainly sounds like
an attractive prospect. And there's nothing to worry about in terms of
secrecy or sufficient randomness - by default, attackers cannot
generate more than 1000 hash collisions in one lookup, period.

>> That said, even with collision counting I'd like a way to disable it without
>> changing the code, e.g. a flag or environment variable.
>
> Agreed.  Paranoid people can turn the behavior off and if it ever were
> to become a problem in practice we could point people to a solution.

Does MAL's patch allow the limit to be set on a per-dict basis
(including setting it to None to disable collision limiting
completely)? If people have data sets that need to tolerate that kind
of collision level (and haven't already decided to move to a data
structure other than the builtin dict), then it may make sense to
allow them to remove the limit when using trusted input.

For maintenance versions though, it would definitely need to be
possible to switch it off without touching the code.

Cheers,
Nick.

--

-- 
Nick Coghlan   |   ncoghlan <at> gmail.com   |   Brisbane, Australia
Stephen J. Turnbull | 14 Jan 09:05 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Jack Diederich writes:
 > On Thu, Jan 12, 2012 at 9:57 PM, Guido van Rossum <guido <at> python.org> wrote:
 > > Hm... I started out as a big fan of the randomized hash, but thinking more
 > > about it, I actually believe that the chances of some legitimate app having
 > >>1000 collisions are way smaller than the chances that somebody's code will
 > > break due to the variable hashing.
 > 
 > Python's dicts are designed to avoid hash conflicts by resizing and
 > keeping the available slots bountiful.  1000 conflicts sounds like a
 > number that couldn't be hit accidentally

I may be missing something, but AIUI, with the resize, the search for
an unused slot after collision will be looking in a different series
of slots, so the N counter for the N^2 behavior resets on resize.  If
not, you can delete this message now.

If so, since (a) in the error-on-many-collisions approach we're adding
a test here for collision count anyway and (b) we think this is almost
never gonna happen, can't we defuse the exploit by just resizing the
dict after 1000 collisions, with strictly better performance than the
error approach, and almost current performance for "normal" input?

In order to prevent attackers from exploiting every 1000th collision
to force out-of-memory, the expansion factor for collision-induced
resizing could be "very small".  (I don't know if that's possible in
the Python dict implementation, if the algorithm requires something
like doubling the dict size on every resize this is right out, of
course.)

Or, since this is an error/rare path anyway, offer the user a choice
of an error or a resize on hitting 1000 collisions?
Frank Sievertsen | 13 Jan 09:11 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Am 13.01.2012 02:24, schrieb Victor Stinner:
> My patch doesn't fix the DoS, it just make the attack more complex.
> The attacker cannot pregenerate data for an attack: (s)he has first to
> compute the hash secret, and then compute hash collisions using the
> secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
> system). So I hope that computing collisions requires a lot of CPU
> time (is slow) to make the attack ineffective with today computers.
Unfortunately it requires only a few seconds to compute enough 32bit 
collisions on one core with no precomputed data.  I'm sure it's possible 
to make this less than a second.

In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) 
^ suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) 
is possible.

So the question is: How difficult is it to guess the seed?

Frank
Victor Stinner | 13 Jan 10:23 2012

Re: Status of the fix for the hash collision vulnerability

> Unfortunately it requires only a few seconds to compute enough 32bit
> collisions on one core with no precomputed data.

Are you running the hash function "backward" to generate strings with
the same value, or you are more trying something like brute forcing?

And how do you get the hash secret? You need it to run an attack.

> In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) ^
> suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) is
> possible.

My change adds also a prefix (a prefix and a suffix). I don't know if
it changes anything for generating collisions.

> So the question is: How difficult is it to guess the seed?

I wrote some remarks about that in the issue. For example:

(hash("\0")^1) ^ (hash("\0\0")^2) gives ((prefix * 1000003) &
HASH_MASK) ^ ((prefix * 1000003**2)  & HASH_MASK)

I suppose that you don't have directly the full output of hash(str) in
practical, but hash(str) & DICT_MASK where DICT_MASK depends is the
size of the internal dict array minus 1. For example, for a dictionary
of 65,536 items, the mask is 0x1ffff and so cannot gives you more than
17 bits of hash(str) output. I still don't know how difficult it is to
retreive hash(str) bits from repr(dict).

Victor
Frank Sievertsen | 13 Jan 12:49 2012
Picon

Re: Status of the fix for the hash collision vulnerability


>> Unfortunately it requires only a few seconds to compute enough 32bit
>> collisions on one core with no precomputed data.
> Are you running the hash function "backward" to generate strings with
> the same value, or you are more trying something like brute forcing?

If you try it brute force to hit a specific target, you'll only find 
only one good string every 4 billion tries. That's why you first blow up 
your target:

You start backward from an arbitrary target-value. You brute force for 3 
characters, for example, this will give you 16 million intermediate 
values from which you know that they'll end up in your target-value.

Those 16 million values are a huge target for now brute-forcing forward: 
Every 256 tries you'll hit one of these values.

> And how do you get the hash secret? You need it to run an attack.

I don't know. This was meant as an answer to the quoted text "So I hope 
that computing collisions requires a lot of CPU time (is slow) to make 
the attack ineffective with today computers.".

What I wanted to say is: The security relies on the fact that the 
attacker can't guess the prefix, not that he can't precompute the values 
and it takes hours or days to compute the collisions. If the prefix 
leaks out of the application, then the rest is trivial and done in a few 
seconds. The suffix is not important for the collision-prevention, but 
it will probably make it much harder to guess the prefix.

I don't know an effective way to get the prefix either, (if the 
application doesn't leak full hash(X) values).

Frank
Lennart Regebro | 13 Jan 12:20 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Fri, Jan 13, 2012 at 02:24, Victor Stinner
<victor.stinner <at> haypocalc.com> wrote:
> - Glenn Linderman proposes to fix the vulnerability by adding a new
> "safe" dict type (only accepting string keys). His proof-of-concept
> (SafeDict.py) uses a secret of 64 random bits and uses it to compute
> the hash of a key.

This is my preferred solution. The vulnerability is basically only in
the dictionary you keep the form data you get from a request. This
solves it easily and nicely. It can also be a separate module
installable for Python 2, which many web frameworks still use, so it
can be practical implementable now, and not in a couple of years.

Then again, nothing prevents us from having both this, *and* one of
the other solutions.  :-)

//Lennart
And Clover | 13 Jan 13:45 2012

Re: Status of the fix for the hash collision vulnerability

On 2012-01-13 11:20, Lennart Regebro wrote:
> The vulnerability is basically only in the dictionary you keep the
> form data you get from a request.

I'd have to disagree with this statement. The vulnerability is anywhere 
that creates a dictionary (or set) from attacker-provided keys. That 
would include HTTP headers, RFC822-family subheaders and parameters, the 
environ, input taken from JSON or XML, and so on - and indeed hash 
collision attacks are not at all web-specific.

The problem with having two dict implementations is that a caller would 
have to tell libraries that use dictionaries which implementation to 
use. So for example an argument would have to be passed to json.load[s] 
to specify whether the input was known-sane or potentially hostile.

Any library could ever use dictionaries to process untrusted input *or 
any library that used another library that did* would have to pass such 
a flag through, which would quickly get very unwieldy indeed... or else 
they'd have to just always use safedict, in which case we're in pretty 
much the same position as we are with changing dict anyway.

--

-- 
And Clover
mailto:and <at> doxdesk.com
http://www.doxdesk.com/
gtalk:chat?jid=bobince <at> gmail.com
Victor Stinner | 14 Jan 02:35 2012

Re: Status of the fix for the hash collision vulnerability

> - Glenn Linderman proposes to fix the vulnerability by adding a new
> "safe" dict type (only accepting string keys). His proof-of-concept
> (SafeDict.py) uses a secret of 64 random bits and uses it to compute
> the hash of a key.

We could mix Marc's collision counter with SafeDict idea (being able
to use a different secret for each dict): use hash(key, secret)
(simple example: hash(secret+key)) instead of hash(key) in dict (and
set), and change the secret if we have more than N collisions. But it
would slow down all dict lookup (dict creation, get, set, del, ...).
And getting new random data can also be slow.

SafeDict and hash(secret+key) lose the benefit of the cached hash
result. Because the hash result depends on a argument, we cannot cache
the result anymore, and we have to recompute the hash for each lookup
(even if you lookup the same key twice ore more).

Victor
Glenn Linderman | 14 Jan 03:09 2012

Re: Status of the fix for the hash collision vulnerability

On 1/13/2012 5:35 PM, Victor Stinner wrote:
- Glenn Linderman proposes to fix the vulnerability by adding a new "safe" dict type (only accepting string keys). His proof-of-concept (SafeDict.py) uses a secret of 64 random bits and uses it to compute the hash of a key.
We could mix Marc's collision counter with SafeDict idea (being able to use a different secret for each dict): use hash(key, secret) (simple example: hash(secret+key)) instead of hash(key) in dict (and set), and change the secret if we have more than N collisions. But it would slow down all dict lookup (dict creation, get, set, del, ...). And getting new random data can also be slow. SafeDict and hash(secret+key) lose the benefit of the cached hash result. Because the hash result depends on a argument, we cannot cache the result anymore, and we have to recompute the hash for each lookup (even if you lookup the same key twice ore more). Victor

So integrating SafeDict into dict so it could be automatically converted would mean changing the data structures underneath dict.  Given that, a technique for hash caching could be created, that isn't quite as good as the one in place, but may be less expensive than not caching the hashes.  It would also take more space, a second dict, internally, as well as the secret.

So once the collision counter reaches some threshold (since there would be a functional fallback, it could be much lower than 1000), the secret is obtained, and the keys are rehashed using hash(secret+key).  Now when lookups occur, the object id of the key and the hash of the key are used as the index and hash(secret+key) is stored as a cached value.  This would only benefit lookups by the same object, other objects with the same key value would be recalculated (at least the first time).  Some limit on the number of cached values would probably be appropriate.  This would add complexity, of course, in trying to save time.

An alternate solution would be to convert a dict to a tree once the number of collisions produces poor performance.  Converting to a tree would result in O(log N) instead of O(1) lookup performance, but that is better than the degenerate case of O(N) which is produced by the excessive number of collisions resulting from an attack.  This would require new tree code to be included in the core, of course, probably a red-black tree, which stays balanced.

In either of these cases, the conversion is expensive, because a collision threshold must first be reached to determine the need for conversion, so the hash could already contain lots of data.  If it were too expensive, the attack could still be effective.

Another solution would be to change the collision code, so that colliding keys don't produce O(N) behavior, but some other behavior.  Each colliding entry could convert that entry to a tree of entries, perhaps.  This would require no conversion of "bad dicts", and an attack could at worst convert O(1) performance to O(log N).

Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.
<div>
    On 1/13/2012 5:35 PM, Victor Stinner wrote:
    <blockquote cite="mid:CAMpsgwYFfTvpkfiLCDLoeBZrWJPOdpGN=deEhrOoLzzHf4ubpw <at> mail.gmail.com" type="cite">
      <blockquote type="cite">
        - Glenn Linderman proposes to fix the vulnerability by adding a new
"safe" dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.

      </blockquote>

We could mix Marc's collision counter with SafeDict idea (being able
to use a different secret for each dict): use hash(key, secret)
(simple example: hash(secret+key)) instead of hash(key) in dict (and
set), and change the secret if we have more than N collisions. But it
would slow down all dict lookup (dict creation, get, set, del, ...).
And getting new random data can also be slow.

SafeDict and hash(secret+key) lose the benefit of the cached hash
result. Because the hash result depends on a argument, we cannot cache
the result anymore, and we have to recompute the hash for each lookup
(even if you lookup the same key twice ore more).

Victor

    </blockquote>
    <br>
    So integrating SafeDict into dict so it could be automatically
    converted would mean changing the data structures underneath dict.&nbsp;
    Given that, a technique for hash caching could be created, that
    isn't quite as good as the one in place, but may be less expensive
    than not caching the hashes.&nbsp; It would also take more space, a
    second dict, internally, as well as the secret.<br><br>
    So once the collision counter reaches some threshold (since there
    would be a functional fallback, it could be much lower than 1000),
    the secret is obtained, and the keys are rehashed using
    hash(secret+key).&nbsp; Now when lookups occur, the object id of the key
    and the hash of the key are used as the index and hash(secret+key)
    is stored as a cached value.&nbsp; This would only benefit lookups by the
    same object, other objects with the same key value would be
    recalculated (at least the first time).&nbsp; Some limit on the number of
    cached values would probably be appropriate.&nbsp; This would add
    complexity, of course, in trying to save time.<br><br>
    An alternate solution would be to convert a dict to a tree once the
    number of collisions produces poor performance.&nbsp; Converting to a
    tree would result in O(log N) instead of O(1) lookup performance,
    but that is better than the degenerate case of O(N) which is
    produced by the excessive number of collisions resulting from an
    attack.&nbsp; This would require new tree code to be included in the
    core, of course, probably a red-black tree, which stays balanced.<br><br>
    In either of these cases, the conversion is expensive, because a
    collision threshold must first be reached to determine the need for
    conversion, so the hash could already contain lots of data.&nbsp; If it
    were too expensive, the attack could still be effective.<br><br>
    Another solution would be to change the collision code, so that
    colliding keys don't produce O(N) behavior, but some other
    behavior.&nbsp; Each colliding entry could convert that entry to a tree
    of entries, perhaps.&nbsp; This would require no conversion of "bad
    dicts", and an attack could at worst convert O(1) performance to
    O(log N).<br><br>
    Clearly these ideas are more complex than adding randomization, but
    adding randomization doesn't seem to be produce immunity from
    attack, when data about the randomness is leaked.<br>
</div>
Gregory P. Smith | 14 Jan 03:25 2012

Re: Status of the fix for the hash collision vulnerability


Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.

Which will not normally happen.

I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things.  But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough.

There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on.

-gps
<div>
<div class="gmail_quote"><blockquote class="gmail_quote">
<div bgcolor="#FFFFFF" text="#330033">
<br>
    Clearly these ideas are more complex than adding randomization, but
    adding randomization doesn't seem to be produce immunity from
    attack, when data about the randomness is leaked.<br>
</div>

</blockquote></div>
<br><div>Which will not normally happen.</div>
<div><br></div>
<div>I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things. &nbsp;But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough.</div>

<div><br></div>
<div>There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on.</div>

<div><br></div>
<div>-gps</div>
</div>
Gregory P. Smith | 14 Jan 03:34 2012

Re: Status of the fix for the hash collision vulnerability

btw, Tim's commit message on this one is amusingly relevant. :)




On Fri, Jan 13, 2012 at 6:25 PM, Gregory P. Smith <greg <at> krypto.org> wrote:

Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.

Which will not normally happen.

I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things.  But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough.

There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on.

-gps

<div>
<p>btw, Tim's commit message on this one is amusingly relevant. :)</p>
<div><br></div>
<div>&nbsp;<a href="http://hg.python.org/cpython/diff/8d2bbbbf2cb9/Objects/dictobject.c">http://hg.python.org/cpython/diff/8d2bbbbf2cb9/Objects/dictobject.c</a>
</div>

<div>
<br><br><div class="gmail_quote">On Fri, Jan 13, 2012 at 6:25 PM, Gregory P. Smith <span dir="ltr">&lt;<a href="mailto:greg <at> krypto.org">greg <at> krypto.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<div class="im">
<div class="gmail_quote"><blockquote class="gmail_quote">
<div bgcolor="#FFFFFF" text="#330033">
<br>
    Clearly these ideas are more complex than adding randomization, but
    adding randomization doesn't seem to be produce immunity from
    attack, when data about the randomness is leaked.<br>
</div>

</blockquote></div>
<br>
</div>
<div>Which will not normally happen.</div>
<div><br></div>
<div>I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things. &nbsp;But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough.</div>

<div><br></div>
<div>There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on.</div>

<span class="HOEnZb">
<div><br></div>
<div>-gps</div>
</span>
</blockquote>
</div>
<br>
</div>
</div>
Steven D'Aprano | 15 Jan 05:42 2012

Re: Status of the fix for the hash collision vulnerability

Victor Stinner wrote:

> - Marc Andre Lemburg proposes to fix the vulnerability directly in
> dict (for any key type). The patch raises an exception if a lookup
> causes more than 1000 collisions.

Am I missing something? How does this fix the vulnerability? It seems to me 
that the only thing this does is turn one sort of DOS attack into another sort 
of DOS attack: hostile users will just cause hash collisions until an 
exception is raised and the application falls over.

Catching these exceptions, and recovering from them (how?), would be the 
responsibility of the application author. Given that developers are unlikely 
to ever see 1000 collisions by accident, or even realise that it could happen, 
I don't expect that many people will do that -- until they personally get bitten.

--

-- 
Steven
Nick Coghlan | 15 Jan 06:11 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On Sun, Jan 15, 2012 at 2:42 PM, Steven D'Aprano <steve <at> pearwood.info> wrote:
> Victor Stinner wrote:
>
>> - Marc Andre Lemburg proposes to fix the vulnerability directly in
>> dict (for any key type). The patch raises an exception if a lookup
>> causes more than 1000 collisions.
>
>
>
> Am I missing something? How does this fix the vulnerability? It seems to me
> that the only thing this does is turn one sort of DOS attack into another
> sort of DOS attack: hostile users will just cause hash collisions until an
> exception is raised and the application falls over.
>
> Catching these exceptions, and recovering from them (how?), would be the
> responsibility of the application author. Given that developers are unlikely
> to ever see 1000 collisions by accident, or even realise that it could
> happen, I don't expect that many people will do that -- until they
> personally get bitten.

As I understand it, the way the attack works is that a *single*
malicious request from the attacker can DoS the server by eating CPU
resources while evaluating a massive collision chain induced in a dict
by attacker supplied data. Explicitly truncating the collision chain
boots them out almost immediately (likely with a 500 response for an
internal server error), so they no longer affect other events, threads
and processes on the same machine.

In some ways, the idea is analogous to the way we implement explicit
recursion limiting in an attempt to avoid actually blowing the C stack
- we take a hard-to-detect-and-hard-to-handle situation (i.e. blowing
the C stack or malicious generation of long collision chains in a
dict) and replace it with something that is easy to detect and can be
handled by normal exception processing (i.e. a recursion depth
exception or one reporting an excessive number of slot collisions in a
dict lookup).

That then makes the default dict implementation safe from this kind of
attack by default, and use cases that are getting that many collisions
legitimately can be handled in one of two ways:
- switch to a more appropriate data type (if you're getting that many
collisions with benign data, a dict is probably the wrong container to
be using)
- offer a mechanism (command line switch or environment variable) to
turn the collision limiting off

Now, where you can still potentially run into problems is if a single
shared dict is used to store both benign and malicious data - if the
malicious data makes it into the destination dict before the exception
finally gets triggered, and then benign data also happens to trigger
the same collision chain, then yes, the entire app may fall over.
However, such an app would have been crippled by the original DoS
anyway, since its performance would have been gutted - the collision
chain limiting just means it will trigger exceptions for the cases
that would been insanely slow.

Cheers,
Nick.

--

-- 
Nick Coghlan   |   ncoghlan <at> gmail.com   |   Brisbane, Australia
Paul McMillan | 16 Jan 23:23 2012

Re: Status of the fix for the hash collision vulnerability

> As I understand it, the way the attack works is that a *single*
> malicious request from the attacker can DoS the server by eating CPU
> resources while evaluating a massive collision chain induced in a dict
> by attacker supplied data. Explicitly truncating the collision chain
> boots them out almost immediately (likely with a 500 response for an
> internal server error), so they no longer affect other events, threads
> and processes on the same machine.

This is only true in the specific attack presented at 28c3. If an
attacker can insert data without triggering the attack, it's possible
to produce (in the example of a web application) urls that (regardless
of the request) always produce pathological behavior. For example, a
collection of pathological usernames might make it impossible to list
users (and so choose which ones to delete) without resorting to
removing the problem data at an SQL level.

This is why the "simply throw an error" solution isn't a complete fix.
Making portions of an interface unusable for regular users is clearly
a bad thing, and is clearly applicable to other types of poisoned data
as well. We need to detect collisions and work around them
transparently.

> However, such an app would have been crippled by the original DoS
> anyway, since its performance would have been gutted - the collision
> chain limiting just means it will trigger exceptions for the cases
> that would been insanely slow.

We can do better than saying "it would have been broken before, it's
broken differently now". The universal hash function idea has merit,
and for practical purposes hash randomization would fix this too
(since colliding data is only likely to collide within a single
process, persistent poisoning is far less feasible).

-Paul
Tim Delaney | 17 Jan 00:14 2012
Picon

Re: Status of the fix for the hash collision vulnerability

On 17 January 2012 09:23, Paul McMillan <paul <at> mcmillan.ws> wrote:
This is why the "simply throw an error" solution isn't a complete fix.
Making portions of an interface unusable for regular users is clearly
a bad thing, and is clearly applicable to other types of poisoned data
as well. We need to detect collisions and work around them
transparently.

What if in a pathological collision (e.g. > 1000 collisions), we increased the size of a dict by a small but random amount? Should be transparent, have neglible speed penalty, maximal reuse of existing code, and should be very difficult to attack since the dictionary would change size in a (near) non-deterministic manner when being attacked (i.e. first attack causes non-deterministic remap, next attack should fail).

It should also have near-zero effect on existing tests and frameworks since we would only get the non-deterministic behaviour in pathological cases, which we would presumably need new tests for.

Thoughts?

Tim Delaney 
<div><div class="gmail_quote">On 17 January 2012 09:23, Paul McMillan <span dir="ltr">&lt;<a href="mailto:paul <at> mcmillan.ws">paul <at> mcmillan.ws</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">
<div class="im">This is why the "simply throw an error" solution isn't a complete fix.</div>
Making portions of an interface unusable for regular users is clearly<br>
a bad thing, and is clearly applicable to other types of poisoned data<br>
as well. We need to detect collisions and work around them<br>
transparently.</blockquote>
<div><br></div>
<div>What if in a pathological collision (e.g. &gt; 1000 collisions), we increased the size of a dict by a small but random amount?&nbsp;Should be transparent, have neglible speed penalty, maximal reuse of existing code, and should be very difficult to attack since the dictionary would change size in a (near) non-deterministic manner when being attacked (i.e. first attack causes non-deterministic remap, next attack should fail).</div>
<div><br></div>
<div>It should also have near-zero effect on existing tests and frameworks since we would only get the non-deterministic behaviour in pathological cases, which we would presumably need new tests for.</div>
<div>
<br>
</div>
<div>Thoughts?</div>
<div><br></div>
<div>Tim Delaney&nbsp;</div>
</div></div>
Tim Delaney | 17 Jan 00:17 2012
Picon

Re: Status of the fix for the hash collision vulnerability



On 17 January 2012 10:14, Tim Delaney <timothy.c.delaney <at> gmail.com> wrote:
On 17 January 2012 09:23, Paul McMillan <paul <at> mcmillan.ws> wrote:
This is why the "simply throw an error" solution isn't a complete fix.
Making portions of an interface unusable for regular users is clearly
a bad thing, and is clearly applicable to other types of poisoned data
as well. We need to detect collisions and work around them
transparently.

What if in a pathological collision (e.g. > 1000 collisions), we increased the size of a dict by a small but random amount? Should be transparent, have neglible speed penalty, maximal reuse of existing code, and should be very difficult to attack since the dictionary would change size in a (near) non-deterministic manner when being attacked (i.e. first attack causes non-deterministic remap, next attack should fail).

It should also have near-zero effect on existing tests and frameworks since we would only get the non-deterministic behaviour in pathological cases, which we would presumably need new tests for.

Thoughts?

And one thought I had immediately after hitting send is that there could be an attack of the form "build a huge dict, then hit it with something that causes it to rehash due to >1000 collisions". But that's not really going to be any worse than just building a huge dict and hitting a resize anyway.

Tim Delaney 
<div>
<br><br><div class="gmail_quote">On 17 January 2012 10:14, Tim Delaney <span dir="ltr">&lt;<a href="mailto:timothy.c.delaney <at> gmail.com">timothy.c.delaney <at> gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">
<div class="gmail_quote">
<div class="im">On 17 January 2012 09:23, Paul McMillan <span dir="ltr">&lt;<a href="mailto:paul <at> mcmillan.ws" target="_blank">paul <at> mcmillan.ws</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

<div>This is why the "simply throw an error" solution isn't a complete fix.</div>
Making portions of an interface unusable for regular users is clearly<br>
a bad thing, and is clearly applicable to other types of poisoned data<br>
as well. We need to detect collisions and work around them<br>
transparently.</blockquote>
<div><br></div>
</div>
<div>What if in a pathological collision (e.g. &gt; 1000 collisions), we increased the size of a dict by a small but random amount?&nbsp;Should be transparent, have neglible speed penalty, maximal reuse of existing code, and should be very difficult to attack since the dictionary would change size in a (near) non-deterministic manner when being attacked (i.e. first attack causes non-deterministic remap, next attack should fail).</div>

<div><br></div>
<div>It should also have near-zero effect on existing tests and frameworks since we would only get the non-deterministic behaviour in pathological cases, which we would presumably need new tests for.</div>
<div>
<br>
</div>
<div>Thoughts?</div>
</div>
</blockquote>
<div><br></div>
<div>And one thought I had immediately after hitting send is that there could be an attack of the form "build a huge dict, then hit it with something that causes it to rehash due to &gt;1000 collisions". But that's not really going to be any worse than just building a huge dict and hitting a resize anyway.</div>
<div><br></div>
<div>Tim Delaney&nbsp;</div>
</div>
</div>
Victor Stinner | 17 Jan 01:16 2012

Re: Status of the fix for the hash collision vulnerability

2012/1/17 Tim Delaney <timothy.c.delaney <at> gmail.com>:
> What if in a pathological collision (e.g. > 1000 collisions), we increased
> the size of a dict by a small but random amount?

It doesn't change anything, you will still get collisions.

Victor
Guido van Rossum | 17 Jan 02:18 2012

Re: Status of the fix for the hash collision vulnerability

On Mon, Jan 16, 2012 at 4:16 PM, Victor Stinner <victor.stinner <at> haypocalc.com> wrote:
2012/1/17 Tim Delaney <timothy.c.delaney <at> gmail.com>:
> What if in a pathological collision (e.g. > 1000 collisions), we increased
> the size of a dict by a small but random amount?

It doesn't change anything, you will still get collisions.

That depends right? If the collision is because they all have the same hash(), yes. It might be different if it is because the secondary hashing (or whatever it's called :-) causes collisions.

--
--Guido van Rossum (python.org/~guido)
<div>
<div class="gmail_quote">On Mon, Jan 16, 2012 at 4:16 PM, Victor Stinner <span dir="ltr">&lt;<a href="mailto:victor.stinner <at> haypocalc.com">victor.stinner <at> haypocalc.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

2012/1/17 Tim Delaney &lt;<a href="mailto:timothy.c.delaney <at> gmail.com">timothy.c.delaney <at> gmail.com</a>&gt;:<br><div class="im">&gt; What if in a pathological collision (e.g. &gt; 1000 collisions), we increased<br>
&gt; the size of a dict by a small but random amount?<br><br>
</div>It doesn't change anything, you will still get collisions.</blockquote>
<div>
<br>That depends right? If the collision is because they all have the same hash(), yes. It might be different if it is because the secondary hashing (or whatever it's called :-) causes collisions.<br>
</div>
</div>
<br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</div>
martin | 17 Jan 09:16 2012
Picon

Re: Status of the fix for the hash collision vulnerability

>> It doesn't change anything, you will still get collisions.
>
>
> That depends right? If the collision is because they all have the same
> hash(), yes. It might be different if it is because the secondary hashing
> (or whatever it's called :-) causes collisions.

But Python deals with the latter case just fine already. The open hashing
approach relies on the dict resizing "enough" to prevent collisions after
the dictionary has grown. Unless somebody can demonstrate a counter example,
I believe this discussion is a red herring.

Plus: if an attacker could craft keys that deliberately cause collisions
because of the dictionary size, they could likely also craft keys in the same
number that collide on actual hash values, bringing us back to the original
problem.

Regards,
Martin

Stephen J. Turnbull | 17 Jan 12:10 2012
Picon

Re: Status of the fix for the hash collision vulnerability

martin <at> v.loewis.de writes:
 > >> It doesn't change anything, you will still get collisions.
 > >
 > >
 > > That depends right? If the collision is because they all have the same
 > > hash(), yes. It might be different if it is because the secondary hashing
 > > (or whatever it's called :-) causes collisions.
 > 
 > But Python deals with the latter case just fine already. The open hashing
 > approach relies on the dict resizing "enough" to prevent collisions after
 > the dictionary has grown. Unless somebody can demonstrate a counter example,
 > I believe this discussion is a red herring.
 >
 > Plus: if an attacker could craft keys that deliberately cause collisions
 > because of the dictionary size, they could likely also craft keys in the same
 > number that collide on actual hash values, bringing us back to the original
 > problem.

I thought that the original problem was that with N insertions in the
dictionary, by repeatedly inserting different keys generating the same
hash value an attacker could arrange that the cost of finding an open
slot is O(N), and thus the cost of N insertions is O(N^2).

If so, frequent resizing could make the attacker's problem much more
difficult, as the distribution of secondary probes should change with
each resize.
Victor Stinner | 17 Jan 12:55 2012

Re: Status of the fix for the hash collision vulnerability

> I thought that the original problem was that with N insertions in the
> dictionary, by repeatedly inserting different keys generating the same
> hash value an attacker could arrange that the cost of finding an open
> slot is O(N), and thus the cost of N insertions is O(N^2).
>
> If so, frequent resizing could make the attacker's problem much more
> difficult, as the distribution of secondary probes should change with
> each resize.

The attack creates 60,000 strings (or more) with exactly the same hash
value. A dictionary uses hash(str) & DICT_MASK to compute the bucket
index, where DICT_HASH is the number of buckets minus one. If all
strings have the same hash value, we always start in the same bucket
and the key has to be compared to all previous strings to find the
next empty bucket. The attack works because a LOT of strings are
compared and comparing strings is slow.

If hash(str1)&DICT_MASK == hash(str2)&DICT_MASK but
hash(str1)!=hash(str2), strings are not compared (this is a common
optimization in Python), and the so the attack would not be successful
(it would be slow, but not as slow as comparing two strings).

Victor
Jeremy Sanders | 17 Jan 16:39 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Victor Stinner wrote:

> If hash(str1)&DICT_MASK == hash(str2)&DICT_MASK but
> hash(str1)!=hash(str2), strings are not compared (this is a common
> optimization in Python), and the so the attack would not be successful
> (it would be slow, but not as slow as comparing two strings).

It's a shame the hash function can't take a second salt parameter to include 
in the hash. Each dict could have its own salt, generated from a quick 
pseudo-random generator.

Jeremy

Jeremy Sanders | 17 Jan 16:44 2012
Picon

Re: Status of the fix for the hash collision vulnerability

Jeremy Sanders wrote:

> Victor Stinner wrote:
> 
>> If hash(str1)&DICT_MASK == hash(str2)&DICT_MASK but
>> hash(str1)!=hash(str2), strings are not compared (this is a common
>> optimization in Python), and the so the attack would not be successful
>> (it would be slow, but not as slow as comparing two strings).
> 
> It's a shame the hash function can't take a second salt parameter to
> include in the hash. Each dict could have its own salt, generated from a
> quick pseudo-random generator.

Please ignore... forgot that the hashes are cached for strings!

Jeremy

"Martin v. Löwis" | 17 Jan 21:29 2012
Picon

Re: Status of the fix for the hash collision vulnerability

> I thought that the original problem was that with N insertions in the
> dictionary, by repeatedly inserting different keys generating the same
> hash value an attacker could arrange that the cost of finding an open
> slot is O(N), and thus the cost of N insertions is O(N^2).
> 
> If so, frequent resizing could make the attacker's problem much more
> difficult, as the distribution of secondary probes should change with
> each resize.

Not sure what you mean by "distribution of secondary probes".

Let H be the initial hash value, and let MASK be the current size
of the dictionary. Then I(n), the sequence of dictionary indices
being probed, is computed as

   I(0) = H & MASK
   PERTURB(0) = H
   I(n+1) = (5*I(n) + 1 + PERTURB(n)) & MASK
   PERTURN(n+1) = PERTURB(n) >> 5

So if two objects O1 and O2 have the same hash value H, the sequence of
probed indices is the same for any MASK value. It will be a different
sequence, yes, but they will still collide on each and every slot.

This is the very nature of open addressing. If it *wouldn't* try all
indices in the probe sequence, it may not be possible to perform
the lookup for a key correctly.

Regards,
Martin
Hrvoje Niksic | 18 Jan 11:15 2012

Re: Status of the fix for the hash collision vulnerability

On 01/17/2012 09:29 PM, "Martin v. Löwis" wrote:
>     I(0) = H&  MASK
>     PERTURB(0) = H
>     I(n+1) = (5*I(n) + 1 + PERTURB(n))&  MASK
>     PERTURN(n+1) = PERTURB(n)>>  5
>
> So if two objects O1 and O2 have the same hash value H, the sequence of
> probed indices is the same for any MASK value. It will be a different
> sequence, yes, but they will still collide on each and every slot.
>
> This is the very nature of open addressing.

Open addressing can still deploy a collision resolution mechanism 
without this property. For example, double hashing uses a different hash 
function (applied to the key) to calculate PERTURB(0). To defeat it, the 
attacker would have to produce keys that hash the same using both hash 
functions.

Double hashing is not a good general solution for Python dicts because 
it complicates the interface of hash tables that support arbitrary keys. 
Still, it could be considered for dicts with known key types (built-ins 
could hardcode the alternative hash function) or for SafeDicts, if they 
are still considered.

Hrvoje
_______________________________________________
Python-Dev mailing list
Python-Dev <at> python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/python-python-dev%40m.gmane.org

Gmane