I figured out a method that works ~35% faster than your 6bits+Carmack+sqrt code, at least with my CPU (x86) and programming language (C/C++). Your results may vary, especially because I don't know how the Java factor will play out.
My approach is threefold:
First, filter out obvious answers. This includes negative numbers and looking at the last 4 bits. (I found looking at the last six didn't help.) I also answer yes for 0. (In reading the code below, note that my input is
if( x < 0 || (x&2) || ((x & 7) == 5) || ((x & 11) == 8) ) return false;
if( x == 0 ) return true;
Next, check if it's a square modulo 255 = 3 * 5 * 17. Because that's a product of three distinct primes, only about 1/8 of the residues mod 255 are squares. However, in my experience, calling the modulo operator (%) costs more than the benefit one gets, so I use bit tricks involving 255 = 2^8-1 to compute the residue. (For better or worse, I am not using the trick of reading individual bytes out of a word, only bitwise-and and shifts.)
int64 y = x;
y = (y & 4294967295LL) + (y >> 32); y = (y & 65535) + (y >> 16); y = (y & 255) + ((y >> 8) & 255) + (y >> 16); // At this point, y is between 0 and 511. More code can reduce it farther.