Sunday, September 23, 2012

Exchange 2010 Randomly Loosing Access to Active Directory

I had an issue at a customer site where a vitalised multi role Exchange 2010 server was randomly loosing access to Active Directory.  There were two Active Directory Domain Controllers with the Global Catalog role in the same Active Directory site as the Exchange 2010 server with highspeed 1gbps LAN between the servers.

When the issue occured Exchange 2010 would begin spitting the generic errors you receive whenever there is no Active Directory domain controller available.  Some of these errors include:

Log Name:      Application
Source:        MSExchange ADAccess
Date:          13/08/2012 8:58:37 AM
Event ID:      2114
Task Category: Topology
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Exchange2010.domain.local
Description:
Process STORE.EXE (PID=3788). Topology discovery failed, error 0x80040952 (LDAP_LOCAL_ERROR (Client-side internal error or bad LDAP message)). Look up the Lightweight Directory Access Protocol (LDAP) error code specified in the event description. To do this, use Microsoft Knowledge Base article 218185, "Microsoft LDAP Error Codes." Use the information in that article to learn more about the cause and resolution to this error. Use the Ping or PathPing command-line tools to test network connectivity to local domain controllers.




Log Name:      Application
Source:        MSExchange ADAccess
Date:          13/08/2012 9:01:56 AM
Event ID:      2103
Task Category: Topology
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Exchange2010.domain.local
Description:
Process MSEXCHANGEADTOPOLOGYSERVICE.EXE (PID=1468). All Global Catalog Servers in forest DC=internal,DC=domain,DC=com are not responding:
DC1.domain.local
DC2.domain.local



Log Name:      Application
Source:        MSExchange ADAccess
Date:          13/08/2012 9:04:56 AM
Event ID:      2604
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Exchange2010.domain.local
Description:
Process MSEXCHANGEADTOPOLOGY (PID=1468). When updating security for a remote procedure call (RPC) access for the Microsoft Exchange Active Directory Topology service, Exchange could not retrieve the security descriptor for Exchange server object Exchange2010 - Error code=80040934.
 The Microsoft Exchange Active Directory Topology service will continue starting with limited permissions.



Log Name:      Application
Source:        MSExchange ADAccess
Date:          13/08/2012 9:07:56 AM
Event ID:      2501
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Exchange2010.domain.local
Description:
Process MSEXCHANGEADTOPOLOGY (PID=1468). The site monitor API was unable to verify the site name for this Exchange computer - Call=HrSearch Error code=80040934. Make sure that Exchange server is correctly registered on the DNS server.


 
When this issue was occuring I verified that the Exchange 2010 server was successfully talking to a domain controller in the same Active Directory site by issuing the following command from a command prompt:
 
NLTEST /DSGETDC:domain.local
 
The problem was with the Exchange 2010 application itself randomly loosing access to Active Directory.
 
After further diagnosing I made the following changes to the Windows TCP Network stack on the Exchange2010 server:
 
netsh int tcp set global chimney=disabled
netsh int tcp set global rss=disabled
netsh int tcp set global taskoffload=disabled
netsh int tcp set global autotuninglevel=disabled
 
 
This resolved the problem.

Only run these commands on your Exchange 2010 server if you are sure that there is a Active Directory Domain Controller in the same Active Directory site as your Exchange 2010 server and the Exchange 2010 server is able to communicate with the Active Directory domain controller.  Ensure you diagnose all other possible resolutions first such as network/storage/cpu/memory bottlenecks.

Hope this post has been helpful.

16 comments:

  1. Clint,

    Thanks for sharing your experience with this issue. It has been most helpful.

    Strangely I encountered a similar issue in my test environment (I was doing some testing on Active Directory Privilege Escalation) and was at a loss to figure out what was going wrong.

    I then tried the steps you suggested, and the problem seems to be resolved, at least for now.

    Do you seem to know what might be causing this issue? I'm not a big fan of tweaking my TCP stack frequently.

    ReplyDelete
  2. Hi Nikolia,

    Windows Vista upwards has changes to the network stack. I have noticed in a few scenarios with latency around remote procedure calls such as SMB, WMI and calls to Active Directory. I had at another customer problems with PC's accessing file shares which was a similar fix, please see the following blog post: http://clintboessen.blogspot.com.au/2012/04/windows-7-slow-access-to-network-shares.html.

    I do not recommend modifying the TCP Stack, I encourage administrators to examine all areas such as CPU utilisation, Memory, Network congestion, Disk I/O bottlenecks, application/system eventlogs and portential network appliances which attempt to sniff, monitor, or modify network packets in anyway before making the changes documented above.

    In both scenerios I have come across, this has been a last resort.

    Kind Regards,
    Clint

    ReplyDelete
  3. Clint,

    This has been happening to our Exchange 2010 Server once a month for the past four months. The mail server is getting the same errors as you've posted.

    X2010 is running on a HP G5 Dual Xeon with 16GB memoey and over 1TB raid5 storage. Memory usage averages 95%, and with only 225 mailboxes I am well within recommended memory as per Microsoft documentation. CPU average 5%, Network utilization 3% or less.

    Aside from a full backup, is there any way to revert back to the TCP Stack settings?

    Many thanks,

    Jimme

    ReplyDelete
  4. Yes simply use the same commands to re-enable the settings.

    ReplyDelete
  5. Egg on my face. I see my question has an obvious answer. I'm researching the commands to understand what effect each will have. Thank you again!

    Jimme

    ReplyDelete
  6. Hi Clint ,

    We have similar problem !!
    2 DC Global Catalog, Exchange servers on the same Site .
    Where do you change these values ?
    On the CAS , mailbox , hub servers or all Exchange Servers ?

    thanks in advance.

    Regards

    David

    ReplyDelete
  7. Would this error cause Exchange to go offline?

    ReplyDelete
  8. Noticed a correction:
    netsh int tcp set global taskoffload=disabled should be written as
    netsh int ip set global taskoffload=disabled

    ReplyDelete
  9. Im getting error In exchange server connection to specific domain controller is lost due timeout.

    Will the above solution fix the problem.

    ReplyDelete
  10. Im getting error In exchange server connection to specific domain controller is lost due to timeout.

    Will the above solution fix the problem.

    ReplyDelete
  11. Hi,

    I have the same problem with the same events over:

    - 1 AD Site
    - 2 DC Global Catalog
    - 4 Exchange CAS/HUB Servers
    - 6 Exchange Mailbox Servers

    All servers are VmWare Virtual Machines in vCenter Farm with Hosts ESX 5.0.

    The communications department tells me they see packet loss when the domain controller communicate with exchange servers.

    ReplyDelete
  12. Hi All,
    Came across this when trouble shooting the same issue with the exact same event log entries. If you are experiencing the issue a few hours after raising the Domain and Forest functional levels then either restart the Kerberos Distribution Key service on all DCs or restart them. Appears this is relatively common after such a change. I found the KDC path after 4 hours of trouble shooting, hopefully this comment will help someone else if they find this solution first but have the same history is mine.
    Matt.

    ReplyDelete
  13. Hi all , just want to share that after i patched below hotfixes for win2008r2 sp1 ... above scenario error has occured . However all the settings has been disabled ..but our exchange server still randomly loss connectivity to AD . Now i just re-set it again to disable .. hopefully problem will go away .. anyone have other suggestion ? How do we capture it and really indentify what cause the "loss" connectivity instead ?

    support.microsoft.com/en-us/kb/2545685

    ReplyDelete
  14. Hi All , just want to update .. I have the exact issue of above ... however after applied above settings still the same . So I decided to totally disabled IPv6 .. https://support.microsoft.com/en-us/kb/929852 . Problem all resolved !!

    ReplyDelete
  15. Hi All , I was replying on above my message ... it seem the disabled ipv6 doesn't work for me . We have opened a case with Microsoft , turn out to be a Kerberos issue . Some how all DC suddenly doesn't support EX server encryption algorithm anymore . u will see a error on KDC_ERR_ETYPE_NOSUPP if you captured netmon .

    So the solution is to make below settings to EX servers & AD servers that EX pointing to . Please remember to restart after apply the settings .

    Go to the GPO ->Computer Configuration\ Windows Settings\ Security Settings\ Local Policies\ Security Options-> Network security: Configure encryption types allowed for Kerberos -> Define and select all the protocol .

    You can configure local policies also if you want ... just remember to restart for the settings take effect .

    ReplyDelete
  16. If you are experiencing problems with your PC "Repair All Pc LLC", The best PC Computer and laptop repair in Usa/Canada. Feel Free to contact our technicians team @1888-313-7359.

    ReplyDelete