Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenNSL 3.5.0.1 report #76

Open
bluecmd opened this issue Oct 8, 2018 · 21 comments
Open

OpenNSL 3.5.0.1 report #76

bluecmd opened this issue Oct 8, 2018 · 21 comments

Comments

@bluecmd
Copy link

bluecmd commented Oct 8, 2018

Hi,

We're currently running FBOSS with a naively updated OpenNSL 3.5.0.1.

Since the reported crash in getdeps.sh should occur in opennsl_pkt_alloc we verified the upgrade by using LLDP:

V1008 19:57:34.172925 10343 BcmSwitch.cpp:1520] sendPacketOutOfPort for5
V1008 19:57:34.173009 10343 LldpManager.cpp:191] sent LLDP  on port 5 with CPU MAC 56:ab:3a:05:fc:0a port id XE5 and vlan 552
[..]
V1008 19:57:34.175858 10343 BcmSwitch.cpp:1520] sendPacketOutOfPort for61
V1008 19:57:34.175936 10343 LldpManager.cpp:191] sent LLDP  on port 61 with CPU MAC 56:ab:3a:05:fc:0a port id XE61 and vlan 552
V1008 19:57:34.176015 10343 BcmSwitch.cpp:1520] sendPacketOutOfPort for62
V1008 19:57:34.176088 10343 LldpManager.cpp:191] sent LLDP  on port 62 with CPU MAC 56:ab:3a:05:fc:0a port id XE62 and vlan 552
V1008 19:57:34.176190 10343 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 63
V1008 19:57:34.176255 10343 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 64

No crash was observed.

Using OpenNSL 3.5.0.1 allows using modern kernel drivers and to configure the OpenNSL BCM configuration, so upgrading to it would probably interesting for a lot of folks.

@shri-khare
Copy link
Contributor

Thank you for reporting this.

Could you please share more details about the crash dump? For example, the stack trace etc.?
Also which SDK was FBOSS agent built against when the crash was observed?
The post mentions 'reported crash in getdeps.sh', which crash? could you please share more details about it?

@bluecmd
Copy link
Author

bluecmd commented Oct 9, 2018

Hi,

I'm reporting an absence of a crash and recommending you to reconsider or provide more details of the crash referred to here:

# SIGSEV in opennsl_pkt_alloc()

Essentially "It works for us".

@capveg
Copy link

capveg commented Oct 9, 2018

hi @bluecmd,

Are you sure you didn't have to do anything else to get opennsl 3.5.0.1 working? I can definitely believe that the crash in opennsl_pkt_alloc() has been fixed but there were a number of other changes that were required to get opennsl 3.5.0.1 working -- trivially, opennsl_driver_init()'s prototype changed -- see my changes here to at least get it to compile: #65

And even after it compiled, it was my experience that all of the packet forwarding was broken because the initialization process was quite different.

If you have it working, we'd definitely appreciate to understand how, because if we could update to OpenNSL 3.5.0.1, then we can unlock a bunch of previously unreleased changed (e.g., ACLs) that depend on newer versions.

Please confirm and let us know - thanks as always for the interest!

@bluecmd
Copy link
Author

bluecmd commented Oct 9, 2018

Hi @capveg. I admit it's a bit sneaky, but if you click on "OpenNSL 3.5.0.1" in my report you get the diff of the patch, and you'll see the actual code changes that we did.

Since we have Wedges graciously donated from FB running with ONL + FBOSS we're more than happy to help you collect any data that you need to debug any issues, but as far as we've seen It Just Works(TM) with the somewhat trivial patch of essentially only changing the opennsl_driver_init call.

EDIT: Direct link to what I'm talking about here: https://github.com/dhtech/fboss/pull/4/files#diff-941e4fb204c29b957373093d97373880
EDIT x2: And we also needed to specify OPENNSL_CONFIG_FILE=/etc/config.wedge40 as the environment of course.

@capveg
Copy link

capveg commented Oct 9, 2018

Hmm... so your patch looks effectively identical to my patch... so I'm wondering why your's works. I saw in one of the comments there a "Status: not working" - can you clarify? Just because the FBOSS agent logs "sending lldp to X" doesn't necessarily mean it's happening. Are you seeing that packet received on the other side? Sorry if this seems pedantic - but we've been (admittedly, slowly) debugging this for a while...

@bluecmd
Copy link
Author

bluecmd commented Oct 9, 2018

The status: not working is my quest to downrate the serdes's to support 1G line rate (https://github.com/Broadcom-Switch/OpenNSL/issues/37).

No worries, I also wouldn't trust strangers on the internet.
What I can give you in terms of proof is the neighbouring Cisco switch receiving the LLDP and accepting them:

Switch#show lldp neighbors
Capability codes:
    (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
    (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other

Device ID           Local Intf     Hold-time  Capability      Port ID
wedge1              Te1/0/2        120        R               XE5

Total entries displayed: 1

@bluecmd
Copy link
Author

bluecmd commented Oct 9, 2018

Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?

EDIT:

dhtech@wedge1:~$ strings /lib/modules/4.14.48-OpenNetworkLinux/linux-kernel-bde.ko | grep OpenNSL | head -n1
/home/bluecmd/OpenNSL/sdk-6.5.12-gpl-modules/include/sal

@capveg
Copy link

capveg commented Oct 9, 2018

Thanks for all the info.

The kernel API is fairly stable so I'm not surprised that the 3.5.0.1 kernel modules work for older versions of OpenNSL. I wouldn't run that way long term (it's definitely not a tested setup :-), but not surprised it works. We run a fairly new kernel internally... let me confirm some details with some other folks and see if we can come up with a theory.

In any case, glad to hear this is working for you.

@bluecmd
Copy link
Author

bluecmd commented Oct 9, 2018

Just to add more data to keep myself honest:

dhtech@wedge1:~$ ldd /usr/local/bin/wedge_agent  | grep libopennsl
        libopennsl.so.1 => /usr/local/lib/libopennsl.so.1 (0x00007fe4583cc000)
dhtech@wedge1:~$ sudo find / -name libopennsl.so.1 | xargs sha1sum
c5a00a16bb0e0be3d557a6e21bc1ee43aa06d4c2  /usr/local/lib/libopennsl.so.1

That matches with the Dec-27 release that's current in https://github.com/Broadcom-Switch/OpenNSL/tree/master/bin/wedge. So I'm pretty sure I'm not messing up the versioning on my end.

@sonoble
Copy link
Contributor

sonoble commented Oct 10, 2018

Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece?

Can you provide any more information about the hacks? Compiling OpenNSL 3.5.0.1 for the 4.14 kernel I have fixed pci_enable_msix, copy_to/from_user and dev->trans_start = jiffies; but FBOSS is still having issues:

I1010 20:50:13.945568 4058 BcmSwitch.cpp:560] Initializing BcmSwitch for unit 0
*** Aborted at 1539204614 (unix time) try "date -d @1539204614" if you are using GNU date ***
PC: @ 0x560640774da2 std::unique_ptr<>::get()

@bluecmd
Copy link
Author

bluecmd commented Oct 11, 2018

It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running.

A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.

@bluecmd
Copy link
Author

bluecmd commented Oct 11, 2018

Ah @sonoble, looking at the last line of your report you're probably hitting #74. Not sure without the full stack trace however.

You can try using our fork that is using FBOSS from May with some patches applied: https://github.com/dhtech/fboss if you need it up and running right now.

@sonoble
Copy link
Contributor

sonoble commented Oct 11, 2018

It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running.

A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1.

No one runs knet that I know of. Looks like your changes are the same as mine. I build the entire OpenNSL from source, so I just set the KERNEL_SRC and LINUX_UAPI_SPLIT="1".

I don't need FBOSS running right now, I was just trying to confirm that your patch worked for me on the 40's. I have been working on getting everything working on the 100S but in a totally different way, by removing the init from OpenNSL and having FBOSS handle it.

I will build your fboss and see if I can get it working.

Thank you!

@sonoble
Copy link
Contributor

sonoble commented Oct 11, 2018

I built your fboss + the modified OpenNSL and while everything is running, there are no interfaces at all using your config or mine. I will dig more into it later.

@sonoble
Copy link
Contributor

sonoble commented Oct 12, 2018

@bluecmd I don't see it in this thread, have you been able to confirm packets other than LLDP are passing? We have seen LLDP packets before but were unable to ping or send any different traffic between boxes.

@bluecmd
Copy link
Author

bluecmd commented Oct 13, 2018

Only LLDP so far as well as normal L2 switching.

@sonoble
Copy link
Contributor

sonoble commented Oct 16, 2018

Hi @bluecmd I am able to confirm L2 and LLDP on the Wedge 100S but no L3 (Packets are not making it to the CPU) so no routing protocols can be run. Can you check if you assign an IP to a port that you can or cannot ping it?
Thank you!

@bluecmd
Copy link
Author

bluecmd commented Oct 16, 2018

@sonoble Sure. Do you have any configuration to share to make the time commitment shorter on my part? Also, did this work on 6.4? We only use L2 stuff so I'm not very aware of the state of L3 in FBOSS.

@sonoble
Copy link
Contributor

sonoble commented Oct 16, 2018 via email

@bluecmd
Copy link
Author

bluecmd commented Oct 17, 2018

So these are my observations. This is with 3.5.0.1 and our FBOSS fork from May/June. We have never tried running this with the old FBOSS, so I have no idea if this is a regression - but as requested by @sonoble.

I added an L3 interface like this:

    "interfaces": [
        {
              "intfID": 10,
              "routerID": 0,
              "vlanID": 552,
              "ipAddresses": [
                    "10.32.12.250/24"
              ]
        }
    ]

This configured an fboss10 interface that does see the incoming packets:

I1017 21:55:00.147141 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:00.920305 10744 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:00.920405 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:01.147235 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:01.245810 10746 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:01.246166 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:01.246147 10745 SwSwitch.cpp:787] preparing state update add pending entry 10.32.12.1
V1017 21:55:01.246336 10745 NeighborCacheImpl-defs.h:137] Adding pending entry for 10.32.12.1 on interface 10
I1017 21:55:01.246403 10745 SwSwitch.cpp:921] Updating state: old_gen=6 new_gen=7
V1017 21:55:01.246495 10745 BcmSwitch.cpp:1048] updating VLAN 552: 0 ports added, 0 ports removed
V1017 21:55:01.246592 10745 BcmHost.cpp:394] created BcmHost: 10.32.12.1@vrf0. new ref count: 1
V1017 21:55:01.246645 10745 BcmSwitch.cpp:1259] adding pending neighbor entry to 10.32.12.1
V1017 21:55:01.246701 10745 BcmHost.cpp:149] Host entry for BcmHost: 10.32.12.1@vrf0 does not have an egress, create one.
V1017 21:55:01.246842 10745 BcmEgress.cpp:145] programmed L3 egress object 100005 for to CPU on unit 0 for ip: 10.32.12.1 @ brcmif 0 flags 8392704 towards port 0
V1017 21:55:01.246900 10745 BcmHost.cpp:594] insert egress 100005 into egress map
V1017 21:55:01.246962 10745 BcmHost.cpp:131] Adding host entry for : 10.32.12.1
V1017 21:55:01.247110 10745 BcmHost.cpp:135] created L3 host object for BcmHost: 10.32.12.1@vrf0 @egress 100005
V1017 21:55:01.247167 10745 BcmHost.cpp:177] Updating egress 100005 from physical port 0 to physical port 0
V1017 21:55:01.247382 10748 QsfpCache.cpp:101] All 64 ports up to date
V1017 21:55:01.247386 10745 SwSwitch.cpp:970] Update state took 981us
I1017 21:55:02.147334 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:02.247273 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:02.256384 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:02.256596 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:03.147433 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:03.247602 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:03.280399 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:03.280599 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:04.147526 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:04.247937 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:04.281910 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:04.282104 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
I1017 21:55:05.147614 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:05.248259 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:05.296386 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:05.296606 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0
V1017 21:55:05.921184 10744 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 10.32.12.250
V1017 21:55:05.921291 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.250 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
I1017 21:55:06.147709 10750 FunctionScheduler.cpp:505] Now running updateStats
V1017 21:55:06.248940 10744 ArpHandler.cpp:153] sending ARP request on vlan 552 to 10.32.12.1 (ff:ff:ff:ff:ff:ff): 10.32.12.250 is 56:ab:3a:05:fc:0a
V1017 21:55:06.320411 10746 IPv4Handler.cpp:275] not sending arp for 10.32.12.1, pending entry already exists
V1017 21:55:06.320631 10746 TunIntf.cpp:349] Forwarded 1 packets (84 bytes) from host @ fd 62 for interface fboss10 dropped:0

ICMP replies are also sent (looking at tcpdump fboss10) but they never arrive at the pinger.
The IP above is on the same subnet as the management, so there is a bit of ARP shortcuts that can be done there.

Using another IP address that is on its own subnet makes things break earlier. The fboss10 interface still shows some random IPv6 traffic that it captures, so packet capture works - however not much more than that.

FBOSS output:

V1017 22:09:51.115411 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 59
V1017 22:09:51.115458 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 60
V1017 22:09:51.115563 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for61
V1017 22:09:51.115658 10986 LldpManager.cpp:191] sent LLDP  on port 61 with CPU MAC 56:ab:3a:05:fc:0a port id XE61 and vlan 552
V1017 22:09:51.115746 10986 BcmSwitch.cpp:1520] sendPacketOutOfPort for62
V1017 22:09:51.115817 10986 LldpManager.cpp:191] sent LLDP  on port 62 with CPU MAC 56:ab:3a:05:fc:0a port id XE62 and vlan 552
V1017 22:09:51.115864 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 63
V1017 22:09:51.115912 10986 LldpManager.cpp:90] Skipping LLDP send as this port is disabled 64
I1017 22:09:52.114615 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:53.114711 10992 FunctionScheduler.cpp:505] Now running updateStats
V1017 22:09:53.225328 10986 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 77.80.231.34
V1017 22:09:53.225431 10986 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a
I1017 22:09:54.114811 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:55.114902 10992 FunctionScheduler.cpp:505] Now running updateStats
I1017 22:09:56.115000 10992 FunctionScheduler.cpp:505] Now running updateStats

Notice that it is sending out an ARP broadcast but never logs an "sendPacketOutOfPort" message, following the code it is because this path calls "sendPacketSwitched". See here.

Maybe sendPacketSwitched is broken while sendPacketOutOfPort works?

Next steps to confirm that could be:

  1. Test with OpenNSL 6.4 to make sure this used to work
  2. Add logging to the sendPacketSwitched call to see if it fails
  3. See if OpenNSL documentation of how to use the switched TX matches with what is done here for 3.5.0.1.

EDIT: I have a thesis this might also be related to L1 errors, I'll debug a bit and update.

@bluecmd
Copy link
Author

bluecmd commented Oct 19, 2018

Update: Yes, it was L1 error. Having fixed the cabling I can now see packets egressing as well. Ping doesn't work, but that is most likely FBOSS related.

1019 12:05:26.602193  3250 FunctionScheduler.cpp:505] Now running updateStats
V1019 12:05:26.783669  3244 UnresolvedNhopsProber.cpp:53]  Sending probe for unresolved next hop: 77.80.231.34
V1019 12:05:26.783776  3244 ArpHandler.cpp:153] sending ARP request on vlan 922 to 77.80.231.34 (ff:ff:ff:ff:ff:ff): 77.80.231.34 is 56:ab:3a:05:fc:0a

tcpdump on computer:

14:05:26.681010 ARP, Request who-has 77.80.231.34 tell 77.80.231.34, length 50

facebook-github-bot pushed a commit that referenced this issue Mar 14, 2022
Summary:
X-link: facebookincubator/fizz#76

X-link: facebook/proxygen#402

X-link: facebook/folly#1735

X-link: facebookarchive/bistro#60

X-link: facebook/watchman#1012

X-link: facebook/fbthrift#487

Pull Request resolved: #114

X-link: facebook/fb303#27

When using getdeps inside of a container, Python's urllib isn't able to download from dewey lfs (see this post for details https://fb.workplace.com/groups/systemd.and.friends/permalink/2747692278870647/).

This allows for getdeps to use `libcurl` to fetch dependencies, which allows for a getdeps build to work inside the container environment.

Reviewed By: mackorone

Differential Revision: D34696330

fbshipit-source-id: 06cae87eef40dfa3cecacacee49234b6737d546f
arajeev-ARISTA pushed a commit to arajeev-ARISTA/fboss that referenced this issue Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants