Neutron L3 agent HA


neutron-l3-agent is used for L3 config in neutron. Before the discuss of HA for neutron-l3-agent, we need to clarify the following things first:

You create VM and router, then VM can work. After this, the neutron-l3-agent is down, what will happen?
a) For the exist VM, the communication is still work, because the namespace and iptable rules is still on the node. So it can still work.
b) If you create new router and new network, neutron-server will bind the new router to other alive neutron-l3-agent, so it can still work.
c) If you want to boot new VM on the network which connect to this router (which belongs to the dead neutron-l3-agent), it still can boot success and commnuicate with outside.
d) If you want to add/delete/modify network or floating ip on the router which belongs to the dead neutron-l3-agent, it can't work. Because there is no alive neutron-l3-agent can do this job.

 
 
So this when we neutron-l3-agent is down, we can see that it will not affect the existed VMs. And we can easily use monitd to make process "neutron-l3-agent" is always alive.
But in Havana, we have the following bug which will cause "short time network outage" when neutron-l3-agent.
https://bugs.launchpad.net/neutron/+bug/1175695
The outage is because when neutron-l3-agent restart it will destory all the namespace and full sync with DB. The action of destory namespace will cause the outage problem.
I fix this bug in our environment by modify the following line by manual temporary:

/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py
216 #        if self.conf.use_namespaces:
217 #            self._destroy_router_namespaces(self.conf.router_id)

After this restart neutron-l3-agent will not cause network outage.
 
We can use the following script, and run a crontab(every 10 sec) on the server which installed neutronclient (But not on the controller nodes).

#!/usr/bin/python
from neutronclient.v2_0 import client as neutronclient
 
TENANT_NAME="admin"
USERNAME="admin"
PASSWORD="admin"
AUTH_URL="https://10.224.159.107:443/v2.0/"
 
neutron = neutronclient.Client(auth_url=AUTH_URL,
                               username=USERNAME,
                               password=PASSWORD,
                               tenant_name=TENANT_NAME)
 
agents = neutron.list_agents()
alive_l3_agents = []
dead_l3_agents = []
 
for agent in agents['agents']:
    if agent['binary'] == 'neutron-l3-agent' and agent['alive'] == True:
        alive_l3_agents.append(agent)
    if agent['binary'] == 'neutron-l3-agent' and agent['alive'] != True:
        dead_l3_agents.append(agent)
 
if len(alive_l3_agents) == 0 :
    print "No active L3"
 
if len(dead_l3_agents) == 0 :
    print "No dead L3"
 
routers = neutron.list_routers()
dead_routers = []
 
for dead_l3_agent in dead_l3_agents:
    dead_routers = neutron.list_routers_on_l3_agent(dead_l3_agent['id'])
    for dead_router in dead_routers['routers']:
        neutron.remove_router_from_l3_agent(dead_l3_agent['id'], dead_router['id'])
        print "remove_router_from_l3_agent : L3 id is %s, router id is %s" %(dead_l3_agent['id'], dead_router['id'])
        # Currently, only add to the first alive agent
        neutron.add_router_to_l3_agent(alive_l3_agents[0]['id'], {"router_id":dead_router['id']})
        print "add_router_to_l3_agent : L3 id is %s, router id is %s" %(alive_l3_agents[0]['id'], dead_router['id'])

If the physical nodes which have neutron-l3-agent are down, this will affect all the routers belongs to this neutron-l3-agent. We can have a monit tool to monit if the nodes are down, and rebind the routers to other alive neutron-l3-agent. And this will cause some network outage.
We can use the following CLI to rebind the router:

neutron l3-agent-router-remove $dead_l3_agent_id  $router_id
neutron l3-agent-router-add    $alive_l3_agent_id $router_id

This will rebind router from a dead neutron-l3-agent to other alive neutron-l3-agent. This action will cause about 10s network outage. And the more router on the neutron-l3-agent, the more network outage time will have. For example, if we have 2 routers on this neutron-l3-agent. For router1, we may have about 10s network outage. For router2, we may have about 20s network outage. Because the neutron-l3-agent will configure the router one by one.
 
 
This is really a bad thing for apps on the VM, but when we deploy application , we can still use the following method to minimize the network outage time:


If our apps have LB, we can deploy our app in different VMs and these VMs will be bound to different routers. The routers are bound to different neutron-l3-agent. So even one of the server which run the neutron-l3-agent is down, app can still work. And after severl seconds, the router will be rebind to other alive neutron-l3-agents.


相关内容